08

De NSE
Version du 1 juillet 2019 à 22:02 par Sysops (discuter | contributions)

(diff) ← Version précédente | Voir la version courante (diff) | Version suivante → (diff)
Aller à : navigation, rechercher

Modèle:About

Modèle:Machine learning bar

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised.[1][2][3]

Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.[4][5][6]

Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains (especially human brains), which make them incompatible with neuroscience evidences.[7][8][9]Modèle:Toclimit

Sommaire

Definition

Deep learning is a class of machine learning algorithms that:[10]Modèle:Rp

  • use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input.
  • learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners.
  • learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

Overview

Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.[11]

In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own. (Of course, this does not completely obviate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.)[1][12]

The "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.[2] No universally agreed upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth > 2. CAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function.Modèle:Citation needed Beyond that more layers do not add to the function approximator ability of the network. Deep models (CAP > 2) are able to extract better features than shallow models and hence, extra layers help in learning features.

Deep learning architectures are often constructed with a greedy layer-by-layer method.Modèle:Clarify Modèle:Explain Modèle:Citation needed Deep learning helps to disentangle these abstractions and pick out which features improve performance.[1]

For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.

Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors[13] and deep belief networks.[1][14]

Interpretations

Deep neural networks are generally interpreted in terms of the universal approximation theorem[15][16][17][18][19][20] or probabilistic inference.[10][11][1][2][14][21][22]

The classic universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions.[15][16][17][18][19] In 1989, the first proof was published by George Cybenko for sigmoid activation functions[16] and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik.[17]

The universal approximation theorem for deep neural networks concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al.[20] proved that if the width of a deep neural network with ReLU activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable function; If the width is smaller or equal to the input dimension, then deep neural network is not a universal approximator.

The probabilistic interpretation[21] derives from the field of machine learning. It features inference,[10][11][1][2][14][21] as well as the optimization concepts of training and testing, related to fitting and generalization, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.[21] The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks.[23] The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop.[24]

History

The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986,[25][13] and to artificial neural networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.[26][27]

The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965.[28] A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm.[29]

Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980.[30] In 1989, Yann LeCun et al. applied the standard backpropagation algorithm, which had been around as the reverse mode of automatic differentiation since 1970,[31][32][33][34] to a deep neural network with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[35]

By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng et al. suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptron,[36][37][38] a method for performing 3-D object recognition in cluttered scenes. Because it directly used natural images, Cresceptron started the beginning of general-purpose visual learning for natural 3D worlds. Cresceptron is a cascade of layers similar to Neocognitron. But while Neocognitron required a human programmer to hand-merge features, Cresceptron learned an open number of features in each layer without supervision, where each feature is represented by a convolution kernel. Cresceptron segmented each learned object from a cluttered scene through back-analysis through the network. Max pooling, now often adopted by deep neural networks (e.g. ImageNet tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalization.

In 1994, André de Carvalho, together with Mike Fairhurst and David Bisset, published experimental results of a multi-layer boolean neural network, also known as a weightless neural network, composed of a 3-layers self-organising feature extraction neural network module (SOFT) followed by a multi-layer classification neural network module (GSN), which were independently trained. Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.[39]

In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.[40] Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.[41][42]

Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of artificial neural network's (ANN) computational cost and a lack of understanding of how the brain wires its biological networks.

Both shallow and deep learning (e.g., recurrent nets) of ANNs have been explored for many years.[43][44][45] These methods never outperformed non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively.[46] Key difficulties have been analyzed, including gradient diminishing[41] and weak temporal correlation structure in neural predictive models.[47][48] Additional difficulties were the lack of training data and limited computing power.

Most speech recognition researchers moved away from neural nets to pursue generative modeling. An exception was at SRI International in the late 1990s. Funded by the US government's NSA and DARPA, SRI studied deep neural networks in speech and speaker recognition. Heck's speaker recognition team achieved the first significant success with deep neural networks in speech processing in the 1998 National Institute of Standards and Technology Speaker Recognition evaluation.[49] While SRI experienced success with deep neural networks in speaker recognition, they were unsuccessful in demonstrating similar success in speech recognition. The principle of elevating "raw" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features in the late 1990s,[49] showing its superiority over the Mel-Cepstral features that contain stages of fixed transformation from spectrograms. The raw features of speech, waveforms, later produced excellent larger-scale results.[50]

Many aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997.[51] LSTM RNNs avoid the vanishing gradient problem and can learn "Very Deep Learning" tasks[2] that require memories of events that happened thousands of discrete time steps before, which is important for speech. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[52] Later it was combined with connectionist temporal classification (CTC)[53] in stacks of LSTM RNNs.[54] In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[55]

In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh[56] [57][58] showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[59] The papers referred to learning for deep belief nets.

Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR). Results on commonly used evaluation sets such as TIMIT (ASR) and MNIST (image classification), as well as a range of large-vocabulary speech recognition tasks have steadily improved.[60][61][62] Convolutional neural networks (CNNs) were superseded for ASR by CTC[53] for LSTM.[51][55][63][64][65][66][67] but are more successful in computer vision.

The impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US, according to Yann LeCun.[68] Industrial applications of deep learning to large-scale speech recognition started around 2010.

The 2009 NIPS Workshop on Deep Learning for Speech Recognition[69] was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets (DNN) might become practical. It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets.[70] However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems.[60][71] The nature of the recognition errors produced by the two types of systems was characteristically different,[72][69] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems.[10][73][74] Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs. DNN models, stimulated early industrial investment in deep learning for speech recognition,[72][69] eventually leading to pervasive and dominant use in that industry. That analysis was done with comparable performance (less than 1.5% in error rate) between discriminative DNNs and generative models.[60][72][70][75]

In 2010, researchers extended deep learning from TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by decision trees.[76][77][78][73]

Advances in hardware enabled the renewed interest. In 2009, Nvidia was involved in what was called the “big bang” of deep learning, “as deep-learning neural networks were trained with Nvidia graphics processing units (GPUs).”[79] That year, Google Brain used Nvidia GPUs to create capable DNNs. While there, Andrew Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times.[80] In particular, GPUs are well-suited for the matrix/vector math involved in machine learning.[81][82] GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days.[83][84] Specialized hardware and algorithm optimizations can be used for efficient processing.[85]

Deep learning revolution

In 2012, a team led by Dahl won the "Merck Molecular Activity Challenge" using multi-task deep neural networks to predict the biomolecular target of one drug.[86][87] In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the "Tox21 Data Challenge" of NIH, FDA and NCATS.[88][89][90]

Significant additional impacts in image or object recognition were felt from 2011 to 2012. Although CNNs trained by backpropagation had been around for decades, and GPU implementations of NNs for years, including CNNs, fast implementations of CNNs with max-pooling on GPUs in the style of Ciresan and colleagues were needed to progress on computer vision.[81][82][35][91][2] In 2011, this approach achieved for the first time superhuman performance in a visual pattern recognition contest. Also in 2011, it won the ICDAR Chinese handwriting contest, and in May 2012, it won the ISBI image segmentation contest.[92] Until 2011, CNNs did not play a major role at computer vision conferences, but in June 2012, a paper by Ciresan et al. at the leading conference CVPR[4] showed how max-pooling CNNs on GPU can dramatically improve many vision benchmark records. In October 2012, a similar system by Krizhevsky et al.[5] won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. In November 2012, Ciresan et al.'s system also won the ICPR contest on analysis of large medical images for cancer detection, and in the following year also the MICCAI Grand Challenge on the same topic.[93] In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition. The Wolfram Image Identification project publicized these improvements.[94]

Image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.[95][96][97][98]

Some researchers assess that the October 2012 ImageNet victory anchored the start of a "deep learning revolution" that has transformed the AI industry.[99]

In March 2019, Yoshua Bengio, Geoffrey Hinton and Yann LeCun were awarded the Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.

Neural networks

Artificial neural networks

Modèle:Main Artificial neural networks (ANNs) or connectionist systems are computing systems inspired by the biological neural networks that constitute animal brains. Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-specific programming. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the analytic results to identify cats in other images. They have found most use in applications difficult to express with a traditional computer algorithm using rule-based programming.

An ANN is based on a collection of connected units called artificial neurons, (analogous to biological neurons in a biological brain). Each connection (synapse) between neurons can transmit a signal to another neuron. The receiving (postsynaptic) neuron can process the signal(s) and then signal downstream neurons connected to it. Neurons may have state, generally represented by real numbers, typically between 0 and 1. Neurons and synapses may also have a weight that varies as learning proceeds, which can increase or decrease the strength of the signal that it sends downstream.

Typically, neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers multiple times.

The original goal of the neural network approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information.

Neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis.

As of 2017, neural networks typically have a few thousand to a few million units and millions of connections. Despite this number being several order of magnitude less than the number of neurons on a human brain, these networks can perform many tasks at a level beyond that of humans (e.g., recognizing faces, playing "Go"[100] ).

Deep neural networks

Modèle:Technical A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers.[11][2] The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The network moves through the layers calculating the probability of each output. For example, a DNN that is trained to recognize dog breeds will go over the given image and calculate the probability that the dog in the image is a certain breed. The user can review the results and select which probabilities the network should display (above a certain threshold, etc.) and return the proposed label. Each mathematical manipulation as such is considered a layer, and complex DNN have many layers, hence the name "deep" networks. The goal is that eventually, the network will be trained to decompose an image into features, identify trends that exist across all samples and classify new images by their similarities without requiring human input.[101]

DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives.[102] The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network.[11]

Deep architectures include many variants of a few basic approaches. Each architecture has found success in specific domains. It is not always possible to compare the performance of multiple architectures, unless they have been evaluated on the same data sets.

DNNs are typically feedforward networks in which data flows from the input layer to the output layer without looping back. At first, the DNN creates a map of virtual neurons and assigns random numerical values, or "weights", to connections between them. The weights and inputs are multiplied and return an output between 0 and 1. If the network didn’t accurately recognize a particular pattern, an algorithm would adjust the weights.[103] That way the algorithm can make certain parameters more influential, until it determines the correct mathematical manipulation to fully process the data.

Recurrent neural networks (RNNs), in which data can flow in any direction, are used for applications such as language modeling.[104][105][106][107][108] Long short-term memory is particularly effective for this use.[51][109]

Convolutional deep neural networks (CNNs) are used in computer vision.[110] CNNs also have been applied to acoustic modeling for automatic speech recognition (ASR).[67]

Challenges

As with ANNs, many issues can arise with naively trained DNNs. Two common issues are overfitting and computation time.

DNNs are prone to overfitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data. Regularization methods such as Ivakhnenko's unit pruning[29] or weight decay (<math> \ell_2 </math>-regularization) or sparsity (<math> \ell_1 </math>-regularization) can be applied during training to combat overfitting.[111] Alternatively dropout regularization randomly omits units from the hidden layers during training. This helps to exclude rare dependencies.[112] Finally, data can be augmented via methods such as cropping and rotating such that smaller training sets can be increased in size to reduce the chances of overfitting.[113]

DNNs must consider many training parameters, such as the size (number of layers and number of units per layer), the learning rate, and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources. Various tricks, such as batching (computing the gradient on several training examples at once rather than individual examples)[114] speed up computation. Large processing capabilities of many-core architectures (such as, GPUs or the Intel Xeon Phi) have produced significant speedups in training, because of the suitability of such processing architectures for the matrix and vector computations.[115][116]

Alternatively, engineers may look for other types of neural networks with more straightforward and convergent training algorithms. CMAC (cerebellar model articulation controller) is one such kind of neural network. It doesn't require learning rates or randomized initial weights for CMAC. The training process can be guaranteed to converge in one step with a new batch of data, and the computational complexity of the training algorithm is linear with respect to the number of neurons involved.[117][118]

Applications

Automatic speech recognition

Modèle:Main

Large-scale automatic speech recognition is the first and most convincing successful case of deep learning. LSTM RNNs can learn "Very Deep Learning" tasks[2] that involve multi-second intervals containing speech events separated by thousands of discrete time steps, where one time step corresponds to about 10 ms. LSTM with forget gates[109] is competitive with traditional speech recognizers on certain tasks.[52]

The initial success in speech recognition was based on small-scale recognition tasks based on TIMIT. The data set contains 630 speakers from eight major dialects of American English, where each speaker reads 10 sentences.[119] Its small size lets many configurations be tried. More importantly, the TIMIT task concerns phone-sequence recognition, which, unlike word-sequence recognition, allows weak phone bigram language models. This lets the strength of the acoustic modeling aspects of speech recognition be more easily analyzed. The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized since 1991.

Method PER (%)
Randomly Initialized RNN[120] 26.1
Bayesian Triphone GMM-HMM 25.6
Hidden Trajectory (Generative) Model 24.8
Monophone Randomly Initialized DNN 23.4
Monophone DBN-DNN 22.4
Triphone GMM-HMM with BMMI Training 21.7
Monophone DBN-DNN on fbank 20.7
Convolutional DNN[121] 20.0
Convolutional DNN w. Heterogeneous Pooling 18.7
Ensemble DNN/CNN/RNN[122] 18.3
Bidirectional LSTM 17.9
Hierarchical Convolutional Deep Maxout Network[123] 16.5

The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:[10][75][73]

  • Scale-up/out and acclerated DNN training and decoding
  • Sequence discriminative training
  • Feature processing by deep models with solid understanding of the underlying mechanisms
  • Adaptation of DNNs and related deep models
  • Multi-task and transfer learning by DNNs and related deep models
  • CNNs and how to design them to best exploit domain knowledge of speech
  • RNN and its rich LSTM variants
  • Other types of deep models including tensor-based models and integrated deep generative/discriminative models.

All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.[10][124][125][126]

Image recognition

Modèle:Main

A common evaluation set for image classification is the MNIST database data set. MNIST is composed of handwritten digits and includes 60,000 training examples and 10,000 test examples. As with TIMIT, its small size lets users test multiple configurations. A comprehensive list of results on this set is available.[127]

Deep learning-based image recognition has become "superhuman", producing more accurate results than human contestants. This first occurred in 2011.[128]

Deep learning-trained vehicles now interpret 360° camera views.[129] Another example is Facial Dysmorphology Novel Analysis (FDNA) used to analyze cases of human malformation connected to a large database of genetic syndromes.

Visual art processing

Closely related to the progress that has been made in image recognition is the increasing application of deep learning techniques to various visual art tasks. DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) Neural Style Transfer - capturing the style of a given artwork and applying it in a visually pleasing manner to an arbitrary photograph or video, and c) generating striking imagery based on random visual input fields.[130][131]

Natural language processing

Modèle:Main Neural networks have been used for implementing language models since the early 2000s.[104][132] LSTM helped to improve machine translation and language modeling.[105][106][107]

Other key techniques in this field are negative sampling[133] and word embedding. Word embedding, such as word2vec, can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset; the position is represented as a point in a vector space. Using word embedding as an RNN input layer allows the network to parse sentences and phrases using an effective compositional vector grammar. A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN.[134] Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing.[134] Deep neural architectures provide the best results for constituency parsing,[135] sentiment analysis,[136] information retrieval,[137][138] spoken language understanding,[139] machine translation,[105][140] contextual entity linking,[140] writing style recognition,[141] Text classification and others.[142]

Recent developments generalize word embedding to sentence embedding.

Google Translate (GT) uses a large end-to-end long short-term memory network.[143][144][145][146][147][148] Google Neural Machine Translation (GNMT) uses an example-based machine translation method in which the system "learns from millions of examples."[144] It translates "whole sentences at a time, rather than pieces. Google Translate supports over one hundred languages.[144] The network encodes the "semantics of the sentence rather than simply memorizing phrase-to-phrase translations".[144][149] GT uses English as an intermediate between most language pairs.[149]

Drug discovery and toxicology

Modèle:For A large percentage of candidate drugs fail to win regulatory approval. These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects.[150][151] Research has explored use of deep learning to predict the biomolecular targets,[86][87] off-targets, and toxic effects of environmental chemicals in nutrients, household products and drugs.[88][89][90]

AtomNet is a deep learning system for structure-based rational drug design.[152] AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus[153] and multiple sclerosis.[154][155]

Customer relationship management

Modèle:Main Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables. The estimated value function was shown to have a natural interpretation as customer lifetime value.[156]

Recommendation systems

Modèle:Main Recommendation systems have used deep learning to extract meaningful features for a latent factor model for content-based music recommendations.[157] Multiview deep learning has been applied for learning user preferences from multiple domains.[158] The model uses a hybrid collaborative and content-based approach and enhances recommendations in multiple tasks.

Bioinformatics

Modèle:Main An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.[159]

In medical informatics, deep learning was used to predict sleep quality based on data from wearables[160] and predictions of health complications from electronic health record data.[161] Deep learning has also showed efficacy in healthcare.[162]

Mobile advertising

Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server.[163] Deep learning has been used to interpret large, many-dimensioned advertising datasets. Many data points are collected during the request/serve/click internet advertising cycle. This information can form the basis of machine learning to improve ad selection.

Image restoration

Deep learning has been successfully applied to inverse problems such as denoising, super-resolution, inpainting, and film colorization. These applications include learning methods such as "Shrinkage Fields for Effective Image Restoration"[164] which trains on an image dataset, and Deep Image Prior, which trains on the image that needs restoration.

Financial fraud detection

Deep learning is being successfully applied to financial fraud detection and anti-money laundering. "Deep anti-money laundering detection system can spot and recognize relationships and similarities between data and, further down the road, learn to detect anomalies or classify and predict specific events". The solution leverages both supervised learning techniques, such as the classification of suspicious transactions, and unsupervised learning, e.g. anomaly detection. [165]

Military

The United States Department of Defense applied deep learning to train robots in new tasks through observation.[166]

Relation to human cognitive and brain development

Deep learning is closely related to a class of theories of brain development (specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s.[167][168][169][170] These developmental theories were instantiated in computational models, making them predecessors of deep learning systems. These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of nerve growth factor) support the self-organization somewhat analogous to the neural networks utilized in deep learning models. Like the neocortex, neural networks employ a hierarchy of layered filters in which each layer considers information from a prior layer (or the operating environment), and then passes its output (and possibly the original input), to other layers. This process yields a self-organizing stack of transducers, well-tuned to their operating environment. A 1995 description stated, "...the infant's brain seems to organize itself under the influence of waves of so-called trophic-factors ... different regions of the brain become connected sequentially, with one layer of tissue maturing before another and so on until the whole brain is mature."[171]

A variety of approaches have been used to investigate the plausibility of deep learning models from a neurobiological perspective. On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism.[172][173] Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality.[174][175] In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.[176]

Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported. For example, the computations performed by deep learning units could be similar to those of actual neurons[177][178] and neural populations.[179] Similarly, the representations developed by deep learning models are similar to those measured in the primate visual system[180] both at the single-unit[181] and at the population[182] levels.

Commercial activity

Many organizations employ deep learning for particular applications. Facebook's AI lab performs tasks such as automatically tagging uploaded pictures with the names of the people in them.[183]

Google's DeepMind Technologies developed a system capable of learning how to play Atari video games using only pixels as data input. In 2015 they demonstrated their AlphaGo system, which learned the game of Go well enough to beat a professional Go player.[184][185][186] Google Translate uses an LSTM to translate between more than 100 languages.

In 2015, Blippar demonstrated a mobile augmented reality application that uses deep learning to recognize objects in real time.[187]

As of 2008,[188] researchers at The University of Texas at Austin (UT) developed a machine learning framework called Training an Agent Manually via Evaluative Reinforcement, or TAMER, which proposed new methods for robots or computer programs to learn how to perform tasks by interacting with a human instructor.[166]

First developed as TAMER, a new algorithm called Deep TAMER was later introduced in 2018 during a collaboration between U.S. Army Research Laboratory (ARL) and UT researchers. Deep TAMER used deep learning to provide a robot the ability to learn new tasks through observation.[166]

Using Deep TAMER, a robot learned a task with a human trainer, watching video streams or observing a human perform a task in-person. The robot later practiced the task with the help of some coaching from the trainer, who provided feedback such as “good job” and “bad job.”[189]

Criticism and comment

Deep learning has attracted both criticism and comment, in some cases from outside the field of computer science.

Theory

Modèle:See also A main criticism concerns the lack of theory surrounding some methods.[190] Learning in the most common deep architectures is implemented using well-understood gradient descent. However, the theory surrounding other algorithms, such as contrastive divergence is less clear.Modèle:Citation needed (e.g., Does it converge? If so, how fast? What is it approximating?) Deep learning methods are often looked at as a black box, with most confirmations done empirically, rather than theoretically.[191]

Others point out that deep learning should be looked at as a step towards realizing strong AI, not as an all-encompassing solution. Despite the power of deep learning methods, they still lack much of the functionality needed for realizing this goal entirely. Research psychologist Gary Marcus noted:
"Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. The most powerful A.I. systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning."[192]
As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between "old master" and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy.[193] This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.[194]

In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[195] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's[196] web site.

Errors

Some deep learning architectures display problematic behaviors,[197] such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images[198] and misclassifying minuscule perturbations of correctly classified images.[199] Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component artificial general intelligence (AGI) architectures.[197] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[200] decompositions of observed entities and events.[197] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[201] and artificial intelligence (AI).[202]

Cyberthreat

As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception. By identifying patterns that these systems use to function, attackers can modify inputs to ANNs in such a way that the ANN finds a match that human observers would not recognize. For example, an attacker can make subtle changes to an image such that the ANN finds a match even though the image looks to a human nothing like the search target. Such a manipulation is termed an “adversarial attack.” In 2016 researchers used one ANN to doctor images in trial and error fashion, identify another's focal points and thereby generate images that deceived it. The modified images looked no different to human eyes. Another group showed that printouts of doctored images then photographed successfully tricked an image classification system.[203] One defense is reverse image search, in which a possible fake image is submitted to a site such as TinEye that can then find other instances of it. A refinement is to search using only parts of the image, to identify images from which that piece may have been taken.[204]

Another group showed that certain psychedelic spectacles could fool a facial recognition system into thinking ordinary people were celebrities, potentially allowing one person to impersonate another. In 2017 researchers added stickers to stop signs and caused an ANN to misclassify them.[203]

ANNs can however be further trained to detect attempts at deception, potentially leading attackers and defenders into an arms race similar to the kind that already defines the malware defense industry. ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target.[203]

Another group demonstrated that certain sounds could make the Google Now voice command system open a particular web address that would download malware.[203]

In “data poisoning”, false data is continually smuggled into a machine learning system’s training set to prevent it from achieving mastery.[203]

See also

References

Modèle:Reflist

Further reading

Modèle:Refbegin

Modèle:Prone to spamModèle:Z148



Modèle:Refimprove

Neural network software is used to simulate, research, develop, and apply artificial neural networks, software concepts adapted from biological neural networks, and in some cases, a wider array of adaptive systems such as artificial intelligence and machine learning.

Simulators

Neural network simulators are software applications that are used to simulate the behavior of artificial or biological neural networks. They focus on one or a limited number of specific types of neural networks. They are typically stand-alone and not intended to produce general neural networks that can be integrated in other software. Simulators usually have some form of built-in visualization to monitor the training process. Some simulators also visualize the physical structure of the neural network.

Research simulators

Fichier:Snns screen.jpg
SNNS research neural network simulator

Historically, the most common type of neural network software was intended for researching neural network structures and algorithms. The primary purpose of this type of software is, through simulation, to gain a better understanding of the behavior and the properties of neural networks. Today in the study of artificial neural networks, simulators have largely been replaced by more general component based development environments as research platforms.

Commonly used artificial neural network simulators include the Stuttgart Neural Network Simulator (SNNS), Emergent and Neural Lab.

In the study of biological neural networks however, simulation software is still the only available approach. In such simulators the physical biological and chemical properties of neural tissue, as well as the electromagnetic impulses between the neurons are studied.

Commonly used biological network simulators include Neuron, GENESIS, NEST and Brian.

Data analysis simulators

Unlike the research simulators, data analysis simulators are intended for practical applications of artificial neural networks. Their primary focus is on data mining and forecasting. Data analysis simulators usually have some form of preprocessing capabilities. Unlike the more general development environments data analysis simulators use a relatively simple static neural network that can be configured. A majority of the data analysis simulators on the market use backpropagating networks or self-organizing maps as their core. The advantage of this type of software is that it is relatively easy to use. Neural Designer is one example of a data analysis simulator.

Simulators for teaching neural network theory

When the Parallel Distributed Processing volumes[205] [206][207] were released in 1986-87, they provided some relatively simple software. The original PDP software did not require any programming skills, which led to its adoption by a wide variety of researchers in diverse fields. The original PDP software was developed into a more powerful package called PDP++, which in turn has become an even more powerful platform called Emergent. With each development, the software has become more powerful, but also more daunting for use by beginners.

In 1997, the tLearn software was released to accompany a book.[208] This was a return to the idea of providing a small, user-friendly, simulator that was designed with the novice in mind. tLearn allowed basic feed forward networks, along with simple recurrent networks, both of which can be trained by the simple back propagation algorithm. tLearn has not been updated since 1999.

In 2011, the Basic Prop simulator was released. Basic Prop is a self-contained application, distributed as a platform neutral JAR file, that provides much of the same simple functionality as tLearn.

In 2012, Wintempla included a namespace called NN with a set of C++ classes to implement: feed forward networks, probabilistic neural networks and Kohonen networks. Neural Lab is based on Wintempla classes. Neural Lab tutorial and Wintempla tutorial explains some of these clases for neural networks. The main disadvantage of Wintempla is that it compiles only with Microsoft Visual Studio.

Development environments

Development environments for neural networks differ from the software described above primarily on two accounts – they can be used to develop custom types of neural networks and they support deployment of the neural network outside the environment. In some cases they have advanced preprocessing, analysis and visualization capabilities.[209]

Component based

Fichier:Synapse screen.jpg
Peltarion Synapse component based development environment.

A more modern type of development environments that are currently favored in both industrial and scientific use are based on a component based paradigm. The neural network is constructed by connecting adaptive filter components in a pipe filter flow. This allows for greater flexibility as custom networks can be built as well as custom components used by the network. In many cases this allows a combination of adaptive and non-adaptive components to work together. The data flow is controlled by a control system which is exchangeable as well as the adaptation algorithms. The other important feature is deployment capabilities.

With the advent of component-based frameworks such as .NET and Java, component based development environments are capable of deploying the developed neural network to these frameworks as inheritable components. In addition some software can also deploy these components to several platforms, such as embedded systems.

Component based development environments include: Peltarion Synapse, NeuroDimension NeuroSolutions, Scientific Software Neuro Laboratory, and the LIONsolver integrated software. Free open source component based environments include Encog and Neuroph.

Criticism

A disadvantage of component-based development environments is that they are more complex than simulators. They require more learning to fully operate and are more complicated to develop.

Custom neural networks

The majority implementations of neural networks available are however custom implementations in various programming languages and on various platforms. Basic types of neural networks are simple to implement directly. There are also many programming libraries that contain neural network functionality and that can be used in custom implementations (such as tensorflow, theano, etc., typically providing bindings to languages such as python, C++, Java).

Standards

In order for neural network models to be shared by different applications, a common language is necessary. The Predictive Model Markup Language (PMML) has been proposed to address this need. PMML is an XML-based language which provides a way for applications to define and share neural network models (and other data mining models) between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

PMML consumers and producers

A range of products are being offered to produce and consume PMML. This ever-growing list includes the following neural network products:

  • R: produces PMML for neural nets and other machine learning models via the package pmml.
  • SAS Enterprise Miner: produces PMML for several mining models, including neural networks, linear and logistic regression, decision trees, and other data mining models.
  • SPSS: produces PMML for neural networks as well as many other mining models.
  • STATISTICA: produces PMML for neural networks, data mining models and traditional statistical models.

See also

References

Modèle:Reflist

External links




Category:Artificial neural networks From Wikipedia, the free encyclopedia Jump to navigation Jump to search Artificial neural networks is included in the JEL classification codes as JEL: C45 Wikimedia Commons has media related to Artificial neural network. The main article for this category is Artificial neural networks.

This category are for articles about artificial neural networks (ANN). Subcategories

This category has the following 2 subcategories, out of 2 total. D

   Deep learning‎ (40 P)

N

   Neural network software‎ (27 P)

Pages in category "Artificial neural networks"

The following 156 pages are in this category, out of 156 total. This list may not reflect recent changes (learn more).


   Artificial neural network
   Types of artificial neural networks

A

   Activation function
   ADALINE
   Adaptive neuro fuzzy inference system
   Adaptive resonance theory
   AlexNet
   ALOPEX
   AlterEgo
   Artificial Intelligence System
   Artificial neuron
   Artisto
   Autoassociative memory
   Autoencoder

B

   Backpropagation
   Backpropagation through structure
   Backpropagation through time
   Bcpnn
   Bidirectional associative memory
   Bidirectional recurrent neural networks
   BigDL
   Boltzmann machine

C

   Caffe (software)
   Capsule neural network
   Catastrophic interference
   Cellular neural network
   Cerebellar model articulation controller
   CLEVER score
   CoDi
   Committee machine
   Competitive learning
   Compositional pattern-producing network
   Computational cybernetics
   Computational neurogenetic modeling
   Confabulation (neural networks)
   Connectionist temporal classification
   Convolutional Deep Belief Networks
   Convolutional neural network
   Cover's theorem

D

   Deep belief network
   Deep lambertian networks
   Deep learning
   Deeplearning4j
   Dehaene–Changeux model
   Delta rule
   DexNet
   Differentiable neural computer
   Dropout (neural networks)

E

   Early stopping
   Echo state network
   Electricity price forecasting
   The Emotion Machine
   European Neural Network Society
   Evolutionary acquisition of neural topologies
   Extension neural network
   Extreme learning machine

F

   Feed forward (control)
   Feedforward neural network
   FindFace

G

   Gated recurrent unit
   General regression neural network
   Generalized Hebbian Algorithm
   Generative adversarial network
   Generative topographic map
   Google Neural Machine Translation
   Grossberg network
   Group method of data handling
   Growing self-organizing map

H

   Hard sigmoid
   Helmholtz machine
   Hierarchical temporal memory
   Hopfield network
   Hybrid Kohonen self-organizing map
   Hybrid neural network
   Hyper basis function network
   HyperNEAT

I

   Infomax
   Instantaneously trained neural networks
   Interactive activation and competition networks
   IPO underpricing algorithm

J

   Jpred

L

   Leabra
   Learning rule
   Learning vector quantization
   Lernmatrix
   Linde–Buzo–Gray algorithm
   Liquid state machine
   Long short-term memory

M

   Memtransistor
   Modular neural network
   MoneyBee
   Multi-surface method
   Multilayer perceptron
   Multimodal learning

N

   ND4J (software)
   ND4S
   Neocognitron
   NETtalk (artificial neural network)
   Neural cryptography
   Neural gas
   Neural network software
   Neural network synchronization protocol
   Neural Networks (journal)
   Neural Turing machine
   Neuroevolution
   Neuroevolution of augmenting topologies
   Ni1000
   NVDLA

O

   Oja's rule
   OpenNN
   Optical neural network
   Oscillatory neural network
   Outstar

P

   Perceptron
   Physical neural network
   Probabilistic neural network
   Promoter based genetic algorithm
   Pulse-coupled networks

Q

   Quantum neural network
   Quickprop

R

   Radial basis function
   Radial basis function network
   Random neural network
   Rectifier (neural networks)
   Recurrent neural network
   Recursive neural network
   Relation network
   Reservoir computing
   Residual neural network
   Restricted Boltzmann machine
   Rprop

S

   Self-organizing map
   Semantic neural network
   Sentence embedding
   Siamese network
   Sigmoid function
   Softmax function
   Spiking neural network
   SqueezeNet
   Stochastic neural analog reinforcement calculator
   Stochastic neural network
   Synaptic transistor
   Synaptic weight

T

   Tensor product network
   Time aware long short-term memory
   Time delay neural network
   Triplet loss

U

   U-matrix
   U-Net
   Universal approximation theorem

V

   Vanishing gradient problem

W

   Waifu2x
   WaveNet
   Winner-take-all (computing)
   Word embedding
   Word2

Modèle:Machine learning bar Modèle:Use dmy dates

Fichier:Colored neural network.svg
An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains.[210][211] The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.[212] Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this without any prior knowledge about cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the learning material that they process.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called 'edges'. Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.

The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Modèle:Toclimit

History

Warren McCulloch and Walter Pitts[213] (1943) created a computational model for neural networks based on mathematics and algorithms called threshold logic. This model paved the way for neural network research to split into two approaches. One approach focused on biological processes in the brain while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.[214]

Hebbian learning

In the late 1940s, D. O. Hebb[215] created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and Clark[216] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).[217] Rosenblatt[218] (1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time.[219] In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.[220] The first functional networks with many layers were published by Ivakhnenko and Lapa in 1965, becoming the Group Method of Data Handling.[221][28][222]

Neural network research stagnated after machine learning research by Minsky and Papert (1969),[223] who discovered two key issues with the computational machines that processed neural networks. The first was that basic perceptrons were incapable of processing the exclusive-or circuit. The second was that computers didn't have enough processing power to effectively handle the work required by large neural networks. Neural network research slowed until computers achieved far greater processing power. Much of artificial intelligence had focused on high-level (symbolic) models that are processed by using algorithms, characterized for example by expert systems with knowledge embodied in if-then rules, until in the late 1980s research expanded to low-level (sub-symbolic) machine learning, characterized by knowledge embodied in the parameters of a cognitive model.Modèle:Citation needed

Backpropagation

A key trigger for renewed interest in neural networks and learning was Werbos's (1975) backpropagation algorithm that effectively solved the exclusive-or problem by making the training of multi-layer networks feasible and efficient. Backpropagation distributed the error term back up through the layers, by modifying the weights at each node.[219]

In the mid-1980s, parallel distributed processing became popular under the name connectionism. Rumelhart and McClelland (1986) described the use of connectionism to simulate neural processes.[224]

Support vector machines and other, much simpler methods such as linear classifiers gradually overtook neural networks in machine learning popularity. However, using neural networks transformed some domains, such as the prediction of protein structures.[225][226]

In 1992, max-pooling was introduced to help with least shift invariance and tolerance to deformation to aid in 3D object recognition.[36][227][38] In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[228]

The vanishing gradient problem affects many-layered feedforward networks that used backpropagation and also recurrent neural networks (RNNs).[229][42] As errors propagate from layer to layer, they shrink exponentially with the number of layers, impeding the tuning of neuron weights that is based on those errors, particularly affecting deep networks.

To overcome this problem, Schmidhuber adopted a multi-level hierarchy of networks (1992) pre-trained one level at a time by unsupervised learning and fine-tuned by backpropagation.[230] Behnke (2003) relied only on the sign of the gradient (Rprop)[231] on problems such as image reconstruction and face localization.

Hinton et al. (2006) proposed learning a high-level representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine[232] to model each layer. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[233][234] In 2012, Ng and Dean created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[235]

Earlier challenges in training deep neural networks were successfully addressed with methods such as unsupervised pre-training, while available computing power increased through the use of GPUs and distributed computing. Neural networks were deployed on a large scale, particularly in image and visual recognition problems. This became known as "deep learning".Modèle:Citation needed

Hardware-based designs

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing. Nanodevices[236] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).[237] Ciresan and colleagues (2010)[83] in Schmidhuber's group showed that despite the vanishing gradient problem, GPUs make back-propagation feasible for many-layered feedforward neural networks.

Contests

Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning.[238][239] For example, the bi-directional and multi-dimensional long short-term memory (LSTM)[240][241][242][243] of Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three languages to be learned.[242][241]

Ciresan and colleagues won pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[244] the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge[92] and others. Their neural networks were the first pattern recognizers to achieve human-competitive or even superhuman performance[245] on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem.

Researchers demonstrated (2010) that deep neural networks interfaced to a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.

GPU-based implementations[91] of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[244] the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge,[92] the ImageNet Competition[5] and others.

Deep, highly nonlinear neural architectures similar to the neocognitron[246] and the "standard architecture of vision",[247] inspired by simple and complex cells, were pre-trained by unsupervised methods by Hinton.[53][233] A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs.[248]

Convolutional networks

Modèle:As of, the state of the art in deep learning feedforward networks alternated between convolutional layers and max-pooling layers,[91][249] topped by several fully or sparsely connected layers followed by a final classification layer. Learning is usually done without unsupervised pre-training. In the convolutional layer, there are filters that are convolved with the input. Each filter is equivalent to a weights vector that has to be trained.

Such supervised deep learning methods were the first to achieve human-competitive performance on certain tasks.[245]

Artificial neural networks were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)[250] whose embodiments are Where-What Networks, WWN-1 (2008)[251] through WWN-7 (2013).[252]

Models

Modèle:Confusing

Fichier:Neuron3.png
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals

An artificial neural network is a network of simple elements called artificial neurons, which receive input, change their internal state (activation) according to that input, and produce output depending on the input and activation.

An artificial neuron mimics the working of a biophysical neuron with inputs and outputs, but is not a biological neuron model.

The network forms by connecting the output of certain neurons to the input of other neurons forming a directed, weighted graph. The weights as well as the functions that compute the activation can be modified by a process called learning which is governed by a learning rule.[253]

Components of an artificial neural network

Neurons

A neuron with label <math>j</math> receiving an input <math>p_j(t)</math> from predecessor neurons consists of the following components:[253]

  • an activation <math>a_j(t)</math>, the neuron's state, depending on a discrete time parameter,
  • possibly a threshold <math>\theta_j</math>, which stays fixed unless changed by a learning function,
  • an activation function <math>f</math> that computes the new activation at a given time <math>t+1</math> from <math>a_j(t)</math>, <math>\theta_j</math> and the net input <math>p_j(t)</math> giving rise to the relation
<math> a_j(t+1) = f(a_j(t), p_j(t), \theta_j) </math>,
  • and an output function <math>f_{out}</math> computing the output from the activation
<math> o_j(t) = f_{out}(a_j(t)) </math>.

Often the output function is simply the Identity function.

An input neuron has no predecessor but serves as input interface for the whole network. Similarly an output neuron has no successor and thus serves as output interface of the whole network.

Connections, weights and biases

The network consists of connections, each connection transferring the output of a neuron <math>i</math> to the input of a neuron <math>j</math>. In this sense <math>i</math> is the predecessor of <math>j</math> and <math>j</math> is the successor of <math>i</math>. Each connection is assigned a weight <math>w_{ij}</math>.[253] Sometimes a bias term is added to the total weighted sum of inputs to serve as a threshold to shift the activation function.[254]

Propagation function

The propagation function computes the input <math>p_j(t)</math> to the neuron <math>j</math> from the outputs <math>o_i(t)</math> of predecessor neurons and typically has the form[253]

<math> p_j(t) = \sum_{i} o_i(t) w_{ij} </math>.

When a bias value is added with the function, the above form changes to the following [255]

<math> p_j(t) = \sum_{i} o_i(t) w_{ij}+ w_{0j} </math> , where <math>w_{0j}</math> is a bias.

Learning rule

The learning rule is a rule or an algorithm which modifies the parameters of the neural network, in order for a given input to the network to produce a favored output. This learning process typically amounts to modifying the weights and thresholds of the variables within the network.[253]

Neural networks as functions

Modèle:See also

Neural network models can be viewed as simple mathematical models defining a function <math>\textstyle f : X \rightarrow Y </math> or a distribution over <math>\textstyle X</math> or both <math>\textstyle X</math> and <math>\textstyle Y</math>. Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons or their connectivity).

Mathematically, a neuron's network function <math>\textstyle f(x)</math> is defined as a composition of other functions <math>\textstyle g_i(x)</math>, that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the nonlinear weighted sum, where <math>\textstyle f (x) = K \left(\sum_i w_i g_i(x)\right) </math>, where <math>\textstyle K</math> (commonly referred to as the activation function[256]) is some predefined function, such as the hyperbolic tangent or sigmoid function or softmax function or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions <math>\textstyle g_i</math> as a vector <math>\textstyle g = (g_1, g_2, \ldots, g_n)</math>.

This figure depicts such a decomposition of <math>\textstyle f</math>, with dependencies between variables indicated by arrows. These can be interpreted in two ways.

The first view is the functional view: the input <math>\textstyle x</math> is transformed into a 3-dimensional vector <math>\textstyle h</math>, which is then transformed into a 2-dimensional vector <math>\textstyle g</math>, which is finally transformed into <math>\textstyle f</math>. This view is most commonly encountered in the context of optimization.

The second view is the probabilistic view: the random variable <math>\textstyle F = f(G) </math> depends upon the random variable <math>\textstyle G = g(H)</math>, which depends upon <math>\textstyle H=h(X)</math>, which depends upon the random variable <math>\textstyle X</math>. This view is most commonly encountered in the context of graphical models.

The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of <math>\textstyle g</math> are independent of each other given their input <math>\textstyle h</math>). This naturally enables a degree of parallelism in the implementation.

Fichier:Recurrent ann dependency graph.png
Two separate depictions of the recurrent ANN dependency graph

Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where <math>\textstyle f</math> is shown as being dependent upon itself. However, an implied temporal dependence is not shown.

Learning

Modèle:See also

The possibility of learning has attracted the most interest in neural networks. Given a specific task to solve, and a class of functions <math>\textstyle F</math>, learning means using a set of observations to find <math>\textstyle f^{*} \in F</math> which solves the task in some optimal sense.

This entails defining a cost function <math>\textstyle C : F \rightarrow \mathbb{R}</math> such that, for the optimal solution <math>\textstyle f^*</math>, <math>\textstyle C(f^*) \leq C(f)</math> <math>\textstyle \forall f \in F</math>Modèle:Snd i.e., no solution has a cost less than the cost of the optimal solution (see mathematical optimization).

The cost function <math>\textstyle C</math> is an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.

For applications where the solution is data dependent, the cost must necessarily be a function of the observations, otherwise the model would not relate to the data. It is frequently defined as a statistic to which only approximations can be made. As a simple example, consider the problem of finding the model <math>\textstyle f</math>, which minimizes <math>\textstyle C=E\left[(f(x) - y)^2\right]</math>, for data pairs <math>\textstyle (x,y)</math> drawn from some distribution <math>\textstyle \mathcal{D}</math>. In practical situations we would only have <math>\textstyle N</math> samples from <math>\textstyle \mathcal{D}</math> and thus, for the above example, we would only minimize <math>\textstyle \hat{C}=\frac{1}{N}\sum_{i=1}^N (f(x_i)-y_i)^2</math>. Thus, the cost is minimized over a sample of the data rather than the entire distribution.

When <math>\textstyle N \rightarrow \infty</math> some form of online machine learning must be used, where the cost is reduced as each new example is seen. While online machine learning is often used when <math>\textstyle \mathcal{D}</math> is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.

Choosing a cost function

While it is possible to define an ad hoc cost function, frequently a particular cost function is used, either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function depends on the task.

Backpropagation

Modèle:Main A DNN can be discriminatively trained with the standard backpropagation algorithm. Backpropagation is a method to calculate the gradient of the loss function (produces the cost associated with a given state) with respect to the weights in an ANN.

The basics of continuous backpropagation[221][257][87][258] were derived in the context of control theory by Kelley[259] in 1960 and by Bryson in 1961,[260] using principles of dynamic programming. In 1962, Dreyfus published a simpler derivation based only on the chain rule.[261] Bryson and Ho described it as a multi-stage dynamic system optimization method in 1969.[262][263] In 1970, Linnainmaa finally published the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions.[31][264] This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.[221][257][32][265] In 1973, Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients.[266] In 1974, Werbos mentioned the possibility of applying this principle to Artificial neural networks,[267] and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.[257][34] In 1986, Rumelhart, Hinton and Williams noted that this method can generate useful internal representations of incoming data in hidden layers of neural networks.[203] In 1993, Wan was the first[221] to win an international pattern recognition contest through backpropagation.[268]

The weight updates of backpropagation can be done via stochastic gradient descent using the following equation:

<math> w_{ij}(t + 1) = w_{ij}(t) - \eta\frac{\partial C}{\partial w_{ij}} +\xi(t) </math>

where, <math> \eta </math> is the learning rate, <math> C </math> is the cost (loss) function and <math>\xi(t)</math> a stochastic term. The choice of the cost function depends on factors such as the learning type (supervised, unsupervised, reinforcement, etc.) and the activation function. For example, when performing supervised learning on a multiclass classification problem, common choices for the activation function and cost function are the softmax function and cross entropy function, respectively. The softmax function is defined as <math> p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)} </math> where <math> p_j </math> represents the class probability (output of the unit <math> j </math>) and <math> x_j </math> and <math> x_k </math> represent the total input to units <math> j </math> and <math> k </math> of the same level respectively. Cross entropy is defined as <math> C = -\sum_j d_j \log(p_j) </math> where <math> d_j </math> represents the target probability for output unit <math> j </math> and <math> p_j </math> is the probability output for <math> j </math> after applying the activation function.[269]

These can be used to output object bounding boxes in the form of a binary mask. They are also used for multi-scale regression to increase localization precision. DNN-based regression can learn features that capture geometric information in addition to serving as a good classifier. They remove the requirement to explicitly model parts and their relations. This helps to broaden the variety of objects that can be learned. The model consists of multiple layers, each of which has a rectified linear unit as its activation function for non-linear transformation. Some layers are convolutional, while others are fully connected. Every convolutional layer has an additional max pooling. The network is trained to minimize L2 error for predicting the mask ranging over the entire training set containing bounding boxes represented as masks.

Alternatives to backpropagation include Extreme Learning Machines,[270] "No-prop" networks,[271] training without backtracking,[272] "weightless" networks,[273][114] and non-connectionist neural networks.

Learning paradigms

The three major learning paradigms each correspond to a particular learning task. These are supervised learning, unsupervised learning and reinforcement learning.

Supervised learning

Supervised learning uses a set of example pairs <math> (x, y), x \in X, y \in Y</math> and the aim is to find a function <math> f : X \rightarrow Y </math> in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.[274]

A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output, <math> f(x)</math>, and the target value <math> y</math> over all the example pairs. Minimizing this cost using gradient descent for the class of neural networks called multilayer perceptrons (MLP), produces the backpropagation algorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for hand writing, speech and gesture recognition). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.

Unsupervised learning

In unsupervised learning, some data <math>\textstyle x</math> is given and the cost function to be minimized, that can be any function of the data <math>\textstyle x</math> and the network's output, <math>\textstyle f</math>.

The cost function is dependent on the task (the model domain) and any a priori assumptions (the implicit properties of the model, its parameters and the observed variables).

As a trivial example, consider the model <math>\textstyle f(x) = a</math> where <math>\textstyle a</math> is a constant and the cost <math>\textstyle C=E[(x - f(x))^2]</math>. Minimizing this cost produces a value of <math>\textstyle a</math> that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to the mutual information between <math>\textstyle x</math> and <math>\textstyle f(x)</math>, whereas in statistical modeling, it could be related to the posterior probability of the model given the data (note that in both of those examples those quantities would be maximized rather than minimized).

Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.

Reinforcement learning

Modèle:See also

In reinforcement learning, data <math>\textstyle x</math> are usually not given, but generated by an agent's interactions with the environment. At each point in time <math>\textstyle t</math>, the agent performs an action <math>\textstyle y_t</math> and the environment generates an observation <math>\textstyle x_t</math> and an instantaneous cost <math>\textstyle c_t</math>, according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, e.g., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.

More formally the environment is modeled as a Markov decision process (MDP) with states <math>\textstyle {s_1,...,s_n}\in S </math> and actions <math>\textstyle {a_1,...,a_m} \in A</math> with the following probability distributions: the instantaneous cost distribution <math>\textstyle P(c_t|s_t)</math>, the observation distribution <math>\textstyle P(x_t|s_t)</math> and the transition <math>\textstyle P(s_{t+1}|s_t, a_t)</math>, while a policy is defined as the conditional distribution over actions given the observations. Taken together, the two then define a Markov chain (MC). The aim is to discover the policy (i.e., the MC) that minimizes the cost.

Artificial neural networks are frequently used in reinforcement learning as part of the overall algorithm.[275][276] Dynamic programming was coupled with Artificial neural networks (giving neurodynamic programming) by Bertsekas and Tsitsiklis[277] and applied to multi-dimensional nonlinear problems such as those involved in vehicle routing,[278] natural resources management[279][280] or medicine[281] because of the ability of Artificial neural networks to mitigate losses of accuracy even when reducing the discretization grid density for numerically approximating the solution of the original control problems.

Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.

Learning algorithms

Modèle:See also

Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimizes the cost. Numerous algorithms are available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.

Most employ some form of gradient descent, using backpropagation to compute the actual gradients. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient-related direction. Backpropagation training algorithms fall into three categories:

Evolutionary methods,[283] gene expression programming,[284] simulated annealing,[285] expectation-maximization, non-parametric methods and particle swarm optimization[286] are other methods for training neural networks.

Convergent recursive learning algorithm

This is a learning method specially designed for cerebellar model articulation controller (CMAC) neural networks. In 2004, a recursive least squares algorithm was introduced to train CMAC neural network online.[117] This algorithm can converge in one step and update all weights in one step with any new input data. Initially, this algorithm had computational complexity of O(N3). Based on QR decomposition, this recursive learning algorithm was simplified to be O(N).[118]

Optimization

The optimization algorithm repeats a two phase cycle, propagation and weight update. When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is then compared to the desired output, using a loss function. The resulting error value is calculated for each of the neurons in the output layer. The error values are then propagated from the output back through the network, until each neuron has an associated error value that reflects its contribution to the original output.

Backpropagation uses these error values to calculate the gradient of the loss function. In the second phase, this gradient is fed to the optimization method, which in turn uses it to update the weights, in an attempt to minimize the loss function.

Algorithm

Let <math>N</math> be a neural network with <math>e</math> connections, <math>m</math> inputs, and <math>n</math> outputs.

Below, <math>x_1,x_2,\dots</math> will denote vectors in <math>\mathbb{R}^m</math>, <math>y_1,y_2,\dots</math> vectors in <math>\mathbb{R}^n</math>, and <math>w_0, w_1, w_2, \ldots</math> vectors in <math>\mathbb{R}^e</math>. These are called inputs, outputs and weights respectively.

The neural network corresponds to a function <math>y = f_N(w, x)</math> which, given a weight <math>w</math>, maps an input <math>x</math> to an output <math>y</math>.

The optimization takes as input a sequence of training examples <math>(x_1,y_1), \dots, (x_p, y_p)</math> and produces a sequence of weights <math>w_0, w_1, \dots, w_p</math> starting from some initial weight <math>w_0</math>, usually chosen at random.

These weights are computed in turn: first compute <math>w_i</math> using only <math>(x_i, y_i, w_{i-1})</math> for <math>i = 1, \dots, p</math>. The output of the algorithm is then <math>w_p</math>, giving us a new function <math>x \mapsto f_N(w_p, x)</math>. The computation is the same in each step, hence only the case <math>i = 1</math> is described.

Calculating <math>w_1</math> from <math>(x_1, y_1, w_0)</math> is done by considering a variable weight <math>w</math> and applying gradient descent to the function <math>w\mapsto E(f_N(w, x_1), y_1)</math> to find a local minimum, starting at <math>w = w_0</math>.

This makes <math>w_1</math> the minimizing weight found by gradient descent.

Algorithm in code

Modèle:Tone

To implement the algorithm above, explicit formulas are required for the gradient of the function <math>w \mapsto E(f_N(w, x), y)</math> where the function is <math>E(y,y')= |y-y'|^2</math>.

The learning algorithm can be divided into two phases: propagation and weight update.

Phase 1: propagation

Each propagation involves the following steps:

  1. Propagation forward through the network to generate the output value(s)
  2. Calculation of the cost (error term)
  3. Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons.

Phase 2: weight update

For each weight, the following steps must be followed:

  1. The weight's output delta and input activation are multiplied to find the gradient of the weight.
  2. A ratio (percentage) of the weight's gradient is subtracted from the weight.

This ratio (percentage) influences the speed and quality of learning; it is called the learning rate. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurate the training is. The sign of the gradient of a weight indicates whether the error varies directly with, or inversely to, the weight. Therefore, the weight must be updated in the opposite direction, "descending" the gradient.

Learning is repeated (on new batches) until the network performs adequately.

Pseudocode

The following is pseudocode for a stochastic gradient descent algorithm for training a three-layer network (only one hidden layer):

  initialize network weights (often small random values)
  do
     forEach training example named ex
        prediction = neural-net-output(network, ex)  // forward pass
        actual = teacher-output(ex)
        compute error (prediction - actual) at the output units
        Modèle:Nowrap  // backward pass
        Modèle:Nowrap   // backward pass continued
        update network weights // input layer not modified by error estimate
  until all examples classified correctly or another stopping criterion satisfied
  return the network

The lines labeled "backward pass" can be implemented using the backpropagation algorithm, which calculates the gradient of the error of the network regarding the network's modifiable weights.[287]

Extension

The choice of learning rate <math display="inline">\eta</math> is important, since a high value can cause too strong a change, causing the minimum to be missed, while a too low learning rate slows the training unnecessarily.

Optimizations such as Quickprop are primarily aimed at speeding up error minimization; other improvements mainly try to increase reliability.

Adaptive learning rate

In order to avoid oscillation inside the network such as alternating connection weights, and to improve the rate of convergence, refinements of this algorithm use an adaptive learning rate.[288]

Inertia

By using a variable inertia term (Momentum) <math display="inline">\alpha</math> the gradient and the last change can be weighted such that the weight adjustment additionally depends on the previous change. If the Momentum <math display="inline">\alpha</math> is equal to 0, the change depends solely on the gradient, while a value of 1 will only depend on the last change.

Similar to a ball rolling down a mountain, whose current speed is determined not only by the current slope of the mountain but also by its own inertia, inertia can be added:<math display="block">\Delta w_{ij} (t + 1) = (1- \alpha) \eta \delta_j o_i+\alpha\,\Delta w_{ij}(t)</math>where:

<math display="inline">\Delta w_{ij} (t + 1)</math> is the change in weight <math display="inline">w_{ij} (t + 1)</math> in the connection of neuron <math display="inline">i</math> to neuron <math display="inline">j</math> at time <math display="inline">(t + 1),</math>
<math display="inline">\eta</math> a learning rate (<math display="inline">\eta < 0),</math>
<math display="inline">\delta_j</math> the error signal of neuron <math display="inline">j</math> and
<math display="inline">o_i</math> the output of neuron <math display="inline">i</math>, which is also an input of the current neuron (neuron <math display="inline">j</math>),
<math display="inline">\alpha</math> the influence of the inertial term <math display="inline">\Delta w_{ij} (t)</math> (in <math display="inline">[0,1]</math>). This corresponds to the weight change at the previous point in time.

Inertia makes the current weight change <math display="inline">(t + 1)</math> depend both on the current gradient of the error function (slope of the mountain, 1st summand), as well as on the weight change from the previous point in time (inertia, 2nd summand).

With inertia, the problems of getting stuck (in steep ravines and flat plateaus) are avoided. Since, for example, the gradient of the error function becomes very small in flat plateaus, a plateau would immediately lead to a "deceleration" of the gradient descent. This deceleration is delayed by the addition of the inertia term so that a flat plateau can be escaped more quickly.

Modes of learning

Two modes of learning are available: stochastic and batch. In stochastic learning, each input creates a weight adjustment. In batch learning weights are adjusted based on a batch of inputs, accumulating errors over the batch. Stochastic learning introduces "noise" into the gradient descent process, using the local gradient calculated from one data point; this reduces the chance of the network getting stuck in local minima. However, batch learning typically yields a faster, more stable descent to a local minimum, since each update is performed in the direction of the average error of the batch. A common compromise choice is to use "mini-batches", meaning small batches and with samples in each batch selected stochastically from the entire data set.

Variants

Group method of data handling

Modèle:MainThe Group Method of Data Handling (GMDH)[289] features fully automatic structural and parametric model optimization. The node activation functions are Kolmogorov-Gabor polynomials that permit additions and multiplications. It used a deep feedforward multilayer perceptron with eight layers.[29] It is a supervised learning network that grows layer by layer, where each layer is trained by regression analysis. Useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the task.[290]

Convolutional neural networks

Modèle:MainA convolutional neural network (CNN) is a class of deep, feed-forward networks, composed of one or more convolutional layers with fully connected layers (matching those in typical Artificial neural networks) on top. It uses tied weights and pooling layers. In particular, max-pooling[227] is often structured via Fukushima's convolutional architecture.[30] This architecture allows CNNs to take advantage of the 2D structure of input data.

CNNs are suitable for processing visual and other two-dimensional data.[35][68] They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate.[291] Examples of applications in computer vision include DeepDream[292] and robot navigation.[293]

A recent development has been that of Capsule Neural Network (CapsNet), the idea behind which is to add structures called capsules to a CNN and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher order capsules.[294]

Long short-term memory

Modèle:MainLong short-term memory (LSTM) networks are RNNs that avoid the vanishing gradient problem.[295] LSTM is normally augmented by recurrent gates called forget gates.[109] LSTM networks prevent backpropagated errors from vanishing or exploding.[229] Instead errors can flow backwards through unlimited numbers of virtual layers in space-unfolded LSTM. That is, LSTM can learn "very deep learning" tasks[221] that require memories of events that happened thousands or even millions of discrete time steps ago. Problem-specific LSTM-like topologies can be evolved.[296] LSTM can handle long delays and signals that have a mix of low and high frequency components.

Stacks of LSTM RNNs[297] trained by Connectionist Temporal Classification (CTC)[166] can find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences. CTC achieves both alignment and recognition.

In 2003, LSTM started to become competitive with traditional speech recognizers.[52] In 2007, the combination with CTC achieved first good results on speech data.[54] In 2009, a CTC-trained LSTM was the first RNN to win pattern recognition contests, when it won several competitions in connected handwriting recognition.[221][242] In 2014, Baidu used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark, without traditional speech processing methods.[298] LSTM also improved large-vocabulary speech recognition,[63][64] text-to-speech synthesis,[299] for Google Android,[257][65] and photo-real talking heads.[300] In 2015, Google's speech recognition experienced a 49% improvement through CTC-trained LSTM.[55]

LSTM became popular in Natural Language Processing. Unlike previous models based on HMMs and similar concepts, LSTM can learn to recognise context-sensitive languages.[104] LSTM improved machine translation,[301][105] language modeling[106] and multilingual language processing.[107] LSTM combined with CNNs improved automatic image captioning.[302]

Deep reservoir computing

Modèle:MainDeep Reservoir Computing and Deep Echo State Networks (deepESNs)[303][304] provide a framework for efficiently trained models for hierarchical processing of temporal data, while enabling the investigation of the inherent role of RNN layered composition.Modèle:Clarify

Deep belief networks

Modèle:Main

Fichier:Restricted Boltzmann machine.svg
A restricted Boltzmann machine (RBM) with fully connected visible and hidden units. Note there are no hidden-hidden or visible-visible connections.

A deep belief network (DBN) is a probabilistic, generative model made up of multiple layers of hidden units. It can be considered a composition of simple learning modules that make up each layer.[14]

A DBN can be used to generatively pre-train a DNN by using the learned DBN weights as the initial DNN weights. Backpropagation or other discriminative algorithms can then tune these weights. This is particularly helpful when training data are limited, because poorly initialized weights can significantly hinder model performance. These pre-trained weights are in a region of the weight space that is closer to the optimal weights than were they randomly chosen. This allows for both improved modeling and faster convergence of the fine-tuning phase.[305]

Large memory storage and retrieval neural networks

Large memory storage and retrieval neural networks (LAMSTAR)[306][307] are fast deep learning neural networks of many layers that can use many filters simultaneously. These filters may be nonlinear, stochastic, logic, non-stationary, or even non-analytical. They are biologically motivated and learn continuously.

A LAMSTAR neural network may serve as a dynamic neural network in spatial or time domains or both. Its speed is provided by Hebbian link-weights[308] that integrate the various and usually different filters (preprocessing functions) into its many layers and to dynamically rank the significance of the various layers and functions relative to a given learning task. This grossly imitates biological learning which integrates various preprocessors (cochlea, retina, etc.) and cortexes (auditory, visual, etc.) and their various regions. Its deep learning capability is further enhanced by using inhibition, correlation and its ability to cope with incomplete data, or "lost" neurons or layers even amidst a task. It is fully transparent due to its link weights. The link-weights allow dynamic determination of innovation and redundancy, and facilitate the ranking of layers, of filters or of individual neurons relative to a task.

LAMSTAR has been applied to many domains, including medical[309][90][310] and financial predictions,[311] adaptive filtering of noisy speech in unknown noise,[312] still-image recognition,[313] video image recognition,[314] software security[315] and adaptive control of non-linear systems.[316] LAMSTAR had a much faster learning speed and somewhat lower error rate than a CNN based on ReLU-function filters and max pooling, in 20 comparative studies.[317]

These applications demonstrate delving into aspects of the data that are hidden from shallow learning networks and the human senses, such as in the cases of predicting onset of sleep apnea events,[90] of an electrocardiogram of a fetus as recorded from skin-surface electrodes placed on the mother's abdomen early in pregnancy,[310] of financial prediction[306] or in blind filtering of noisy speech.[312]

LAMSTAR was proposed in 1996 (Modèle:US Patent) and was further developed Graupe and Kordylewski from 1997–2002.[318][319][320] A modified version, known as LAMSTAR 2, was developed by Schneider and Graupe in 2008.[321][322]

Stacked (de-noising) auto-encoders

The auto encoder idea is motivated by the concept of a good representation. For example, for a classifier, a good representation can be defined as one that yields a better-performing classifier.

An encoder is a deterministic mapping <math>f_\theta</math> that transforms an input vector x into hidden representation y, where <math>\theta = \{\boldsymbol{W}, b\}</math>, <math>\boldsymbol{W}</math> is the weight matrix and b is an offset vector (bias). A decoder maps back the hidden representation y to the reconstructed input z via <math>g_\theta</math>. The whole process of auto encoding is to compare this reconstructed input to the original and try to minimize the error to make the reconstructed value as close as possible to the original.

In stacked denoising auto encoders, the partially corrupted output is cleaned (de-noised). This idea was introduced in 2010 by Vincent et al.[323] with a specific approach to good representation, a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. Implicit in this definition are the following ideas:

  • The higher level representations are relatively stable and robust to input corruption;
  • It is necessary to extract features that are useful for representation of the input distribution.

The algorithm starts by a stochastic mapping of <math>\boldsymbol{x}</math> to <math>\tilde{\boldsymbol{x}}</math> through <math>q_D(\tilde{\boldsymbol{x}}|\boldsymbol{x})</math>, this is the corrupting step. Then the corrupted input <math>\tilde{\boldsymbol{x}}</math> passes through a basic auto-encoder process and is mapped to a hidden representation <math>\boldsymbol{y} = f_\theta(\tilde{\boldsymbol{x}}) = s(\boldsymbol{W}\tilde{\boldsymbol{x}}+b)</math>. From this hidden representation, we can reconstruct <math>\boldsymbol{z} = g_\theta(\boldsymbol{y})</math>. In the last stage, a minimization algorithm runs in order to have z as close as possible to uncorrupted input <math>\boldsymbol{x}</math>. The reconstruction error <math>L_H(\boldsymbol{x},\boldsymbol{z})</math> might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder.[323]

In order to make a deep architecture, auto encoders stack.[324] Once the encoding function <math>f_\theta</math> of the first denoising auto encoder is learned and used to uncorrupt the input (corrupted input), the second level can be trained.[323]

Once the stacked auto encoder is trained, its output can be used as the input to a supervised learning algorithm such as support vector machine classifier or a multi-class logistic regression.[323]

Deep stacking networks

A deep stacking network (DSN)[325] (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Dong.[326] It formulates the learning as a convex optimization problem with a closed-form solution, emphasizing the mechanism's similarity to stacked generalization.[327] Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks.[328]

Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer h has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix U; input-to-hidden-layer connections have weight matrix W. Target vectors t form the columns of matrix T, and the input data vectors x form the columns of matrix X. The matrix of hidden units is <math>\boldsymbol{H} = \sigma(\boldsymbol{W}^T\boldsymbol{X})</math>. Modules are trained in order, so lower-layer weights W are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class y, and its estimate is concatenated with original input X to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix U given other weights in the network can be formulated as a convex optimization problem:

<math>\min_{U^T} f = ||\boldsymbol{U}^T \boldsymbol{H} - \boldsymbol{T}||^2_F,</math>

which has a closed-form solution.

Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs perform better than conventional DBNs.[325]

Tensor deep stacking networks

This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer.[329] TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor.

While parallelization and scalability are not considered seriously in conventional Modèle:H:title,[330][331][332] all learning for Modèle:H:titles and Modèle:H:titles is done in batch mode, to allow parallelization.[326][325] Parallelization allows scaling the design to larger (deeper) architectures and data sets.

The basic architecture is suitable for diverse tasks such as classification and regression.

Spike-and-slab RBMs

The need for deep learning with real-valued inputs, as in Gaussian restricted Boltzmann machines, led to the spike-and-slab RBM (ssRBM), which models continuous-valued inputs with strictly binary latent variables.[333] Similar to basic RBMs and its variants, a spike-and-slab RBM is a bipartite graph, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a density over continuous domain;[334] their mixture forms a prior.[335]

An extension of ssRBM called µ-ssRBM provides extra modeling capacity using additional terms in the energy function. One of these terms enables the model to form a conditional distribution of the spike variables by marginalizing out the slab variables given an observation.

Compound hierarchical-deep models

Compound hierarchical-deep models compose deep networks with non-parametric Bayesian models. Features can be learned using deep architectures such as DBNs,[233] DBMs,[336] deep auto encoders,[337] convolutional variants,[338][339] ssRBMs,[334] deep coding networks,[340] DBNs with sparse feature learning,[341] RNNs,[342] conditional DBNs,[343] de-noising auto encoders.[344] This provides a better representation, allowing faster learning and more accurate classification with high-dimensional data. However, these architectures are poor at learning novel classes with few examples, because all network units are involved in representing the input (a Modèle:Visible anchor) and must be adjusted together (high degree of freedom). Limiting the degree of freedom reduces the number of parameters to learn, facilitating learning of new classes from few examples. Hierarchical Bayesian (HB) models allow learning from few examples, for example[345][346][347][348][349] for computer vision, statistics and cognitive science.

Compound HD architectures aim to integrate characteristics of both HB and deep networks. The compound HDP-DBM architecture is a hierarchical Dirichlet process (HDP) as a hierarchical model, incorporated with DBM architecture. It is a full generative model, generalized from abstract concepts flowing through the layers of the model, which is able to synthesize new examples in novel classes that look "reasonably" natural. All the levels are learned jointly by maximizing a joint log-probability score.[350]

In a DBM with three hidden layers, the probability of a visible input Modèle:Mvar is:

<math>p(\boldsymbol{\nu}, \psi) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}},</math>

where <math>\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}</math> is the set of hidden units, and <math>\psi = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} </math> are the model parameters, representing visible-hidden and hidden-hidden symmetric interaction terms.

A learned DBM model is an undirected model that defines the joint distribution <math>P(\nu, h^1, h^2, h^3)</math>. One way to express what has been learned is the conditional model <math>P(\nu, h^1, h^2|h^3)</math> and a prior term <math>P(h^3)</math>.

Here <math>P(\nu, h^1, h^2|h^3)</math> represents a conditional DBM model, which can be viewed as a two-layer DBM but with bias terms given by the states of <math>h^3</math>:

<math>P(\nu, h^1, h^2|h^3) = \frac{1}{Z(\psi, h^3)}e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}}.</math>

Deep predictive coding networks

A deep predictive coding network (DPCN) is a predictive coding scheme that uses top-down information to empirically adjust the priors needed for a bottom-up inference procedure by means of a deep, locally connected, generative model. This works by extracting sparse features from time-varying observations using a linear dynamical model. Then, a pooling strategy is used to learn invariant feature representations. These units compose to form a deep architecture and are trained by greedy layer-wise unsupervised learning. The layers constitute a kind of Markov chain such that the states at any layer depend only on the preceding and succeeding layers.

DPCNs predict the representation of the layer, by using a top-down approach using the information in upper layer and temporal dependencies from previous states.[351]

DPCNs can be extended to form a convolutional network.[351]

Networks with separate memory structures

Integrating external memory with Artificial neural networks dates to early research in distributed representations[352] and Kohonen's self-organizing maps. For example, in sparse distributed memory or hierarchical temporal memory, the patterns encoded by neural networks are used as addresses for content-addressable memory, with "neurons" essentially serving as address encoders and decoders. However, the early controllers of such memories were not differentiable.

LSTM-related differentiable memory structures

Apart from long short-term memory (LSTM), other approaches also added differentiable memory to recurrent functions. For example:

  • Differentiable push and pop actions for alternative memory networks called neural stack machines[353][354]
  • Memory networks where the control network's external differentiable storage is in the fast weights of another network[355]
  • LSTM forget gates[356]
  • Self-referential RNNs with special output units for addressing and rapidly manipulating the RNN's own weights in differentiable fashion (internal storage)[357][358]
  • Learning to transduce with unbounded memory[359]

Neural Turing machines

Modèle:MainNeural Turing machines[360] couple LSTM networks to external memory resources, with which they can interact by attentional processes. The combined system is analogous to a Turing machine but is differentiable end-to-end, allowing it to be efficiently trained by gradient descent. Preliminary results demonstrate that neural Turing machines can infer simple algorithms such as copying, sorting and associative recall from input and output examples.

Differentiable neural computers (DNC) are an NTM extension. They out-performed Neural turing machines, long short-term memory systems and memory networks on sequence-processing tasks.[361][362][363][364][365]

Semantic hashing

Approaches that represent previous experiences directly and use a similar experience to form a local model are often called nearest neighbour or k-nearest neighbors methods.[366] Deep learning is useful in semantic hashing[367] where a deep graphical model the word-count vectors[368] obtained from a large set of documents.Modèle:Clarify Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by accessing all the addresses that differ by only a few bits from the address of the query document. Unlike sparse distributed memory that operates on 1000-bit addresses, semantic hashing works on 32 or 64-bit addresses found in a conventional computer architecture.

Memory networks

Memory networks[369][370] are another extension to neural networks incorporating long-term memory. The long-term memory can be read and written to, with the goal of using it for prediction. These models have been applied in the context of question answering (QA) where the long-term memory effectively acts as a (dynamic) knowledge base and the output is a textual response.[371] A team of electrical and computer engineers from UCLA Samueli School of Engineering has created a physical artificial neural network that can analyze large volumes of data and identify objects at the actual speed of light.[372]

Pointer networks

Deep neural networks can be potentially improved by deepening and parameter reduction, while maintaining trainability. While training extremely deep (e.g., 1 million layers) neural networks might not be practical, CPU-like architectures such as pointer networks[373] and neural random-access machines[374] overcome this limitation by using external random-access memory and other components that typically belong to a computer architecture such as registers, ALU and pointers. Such systems operate on probability distribution vectors stored in memory cells and registers. Thus, the model is fully differentiable and trains end-to-end. The key characteristic of these models is that their depth, the size of their short-term memory, and the number of parameters can be altered independently – unlike models like LSTM, whose number of parameters grows quadratically with memory size.

Encoder–decoder networks

Encoder–decoder frameworks are based on neural networks that map highly structured input to highly structured output. The approach arose in the context of machine translation,[375][376][377] where the input and output are written sentences in two natural languages. In that work, an LSTM RNN or CNN was used as an encoder to summarize a source sentence, and the summary was decoded using a conditional RNN language model to produce the translation.[378] These systems share building blocks: gated RNNs and CNNs and trained attention mechanisms.

Multilayer kernel machine

Multilayer kernel machines (MKM) are a way of learning highly nonlinear functions by iterative application of weakly nonlinear kernels. They use the kernel principal component analysis (KPCA),[379] as a method for the unsupervised greedy layer-wise pre-training step of deep learning.[380]

Layer <math>l+1</math> learns the representation of the previous layer <math>l</math>, extracting the <math>n_l</math> principal component (PC) of the projection layer <math>l</math> output in the feature domain induced by the kernel. For the sake of dimensionality reduction of the updated representation in each layer, a supervised strategy selects the best informative features among features extracted by KPCA. The process is:

  • rank the <math>n_l</math> features according to their mutual information with the class labels;
  • for different values of K and <math>m_l \in\{1, \ldots, n_l\}</math>, compute the classification error rate of a K-nearest neighbor (K-NN) classifier using only the <math>m_l</math> most informative features on a validation set;
  • the value of <math>m_l</math> with which the classifier has reached the lowest error rate determines the number of features to retain.

Some drawbacks accompany the KPCA method as the building cells of an MKM.

A more straightforward way to use kernel machines for deep learning was developed for spoken language understanding.[381] The main idea is to use a kernel machine to approximate a shallow neural net with an infinite number of hidden units, then use stacking to splice the output of the kernel machine and the raw input in building the next, higher level of the kernel machine. The number of levels in the deep convex network is a hyper-parameter of the overall system, to be determined by cross validation.

Neural architecture search

Modèle:Main Neural architecture search (NAS) uses machine learning to automate the design of Artificial neural networks. Various approaches to NAS have designed networks that compare well with hand-designed systems. The basic search algorithm is to propose a candidate model, evaluate it against a dataset and use the results as feedback to teach the NAS network.[382]

Use

Using Artificial neural networks requires an understanding of their characteristics.

  • Choice of model: This depends on the data representation and the application. Overly complex models slow learning.
  • Learning algorithm: Numerous trade-offs exist between learning algorithms. Almost any algorithm will work well with the correct hyperparameters for training on a particular data set. However, selecting and tuning an algorithm for training on unseen data requires significant experimentation.
  • Robustness: If the model, cost function and learning algorithm are selected appropriately, the resulting ANN can become robust.

ANN capabilities fall within the following broad categories:Modèle:Citation needed

Applications

Because of their ability to reproduce and model nonlinear processes, Artificial neural networks have found many applications in a wide range of disciplines.

Application areas include system identification and control (vehicle control, trajectory prediction,[383] process control, natural resource management), quantum chemistry,[384] general game playing,[385] pattern recognition (radar systems, face identification, signal classification,[386] 3D reconstruction,[387] object recognition and more), sequence recognition (gesture, speech, handwritten and printed text recognition), medical diagnosis, finance[388] (e.g. automated trading systems), data mining, visualization, machine translation, social network filtering[389] and e-mail spam filtering.

Artificial neural networks have been used to diagnose cancers, including lung cancer,[390] prostate cancer, colorectal cancer[391] and to distinguish highly invasive cancer cell lines from less invasive lines using only cell shape information.[392][393]

Artificial neural networks have been used to accelerate reliability analysis of infrastructures subject to natural disasters[394][395] and to predict foundation settlements.[396]

Artificial neural networks have also been used for building black-box models in geoscience: hydrology,[397][398] ocean modelling and coastal engineering,[399][400] and geomorphology.[401]

Artificial neural networks have been employed with some success also in cybersecurity, with the objective to discriminate between legitimate activities and malicious ones. For example, machine learning has been used for classifying android malware,[402] for identifying domains belonging to threat actors[403] and for detecting URLs posing a security risk.[404] Research is being carried out also on ANN systems designed for penetration testing,[405] for detecting botnets,[406] credit cards frauds,[407] network intrusions and, more in general, potentially infected machines.

Types of models

Many types of models are used, defined at different levels of abstraction and modeling different aspects of neural systems. They range from models of the short-term behavior of individual neurons,[408] models of how the dynamics of neural circuitry arise from interactions between individual neurons and finally to models of how behavior can arise from abstract neural modules that represent complete subsystems. These include models of the long-term, and short-term plasticity, of neural systems and their relations to learning and memory from the individual neuron to the system level.

Theoretical properties

Computational power

The multilayer perceptron is a universal function approximator, as proven by the universal approximation theorem. However, the proof is not constructive regarding the number of neurons required, the network topology, the weights and the learning parameters.

A specific recurrent architecture with rational valued weights (as opposed to full precision real number-valued weights) has the full power of a universal Turing machine,[409] using a finite number of neurons and standard linear connections. Further, the use of irrational values for weights results in a machine with super-Turing power.[410]

Capacity

Models' "capacity" property roughly corresponds to their ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.Modèle:Citation needed

Convergence

Models may not consistently converge on a single solution, firstly because many local minima may exist, depending on the cost function and the model. Secondly, the optimization method used might not guarantee to converge when it begins far from any local minimum. Thirdly, for sufficiently large data or parameters, some methods become impractical. However, for CMAC neural network, a recursive least squares algorithm was introduced to train it, and this algorithm can be guaranteed to converge in one step.[117]

Generalization and statistics

Applications whose goal is to create a system that generalizes well to unseen examples, face the possibility of over-training. This arises in convoluted or over-specified systems when the capacity of the network significantly exceeds the needed free parameters. Two approaches address over-training. The first is to use cross-validation and similar techniques to check for the presence of over-training and optimally select hyperparameters to minimize the generalization error. The second is to use some form of regularization. This concept emerges in a probabilistic (Bayesian) framework, where regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting.

Fichier:Synapse deployment.jpg
Confidence analysis of a neural network

Supervised neural networks that use a mean squared error (MSE) cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the confidence interval of the output of the network, assuming a normal distribution. A confidence analysis made this way is statistically valid as long as the output probability distribution stays the same and the network is not modified.

By assigning a softmax activation function, a generalization of the logistic function, on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is very useful in classification as it gives a certainty measure on classifications.

The softmax activation function is:

<math>y_i=\frac{e^{x_i}}{\sum_{j=1}^c e^{x_j}}</math>


Criticism

Training issues

A common criticism of neural networks, particularly in robotics, is that they require too much training for real-world operation.Modèle:Citation needed Potential solutions include randomly shuffling training examples, by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example and by grouping examples in so-called mini-batches. Improving the training efficiency and convergence capability has always been an ongoing research area for neural network. For example, by introducing a recursive least squares algorithm for CMAC neural network, the training process only takes one step to converge.[117]

Theoretical issues

A fundamental objection is that they do not reflect how real neurons function. Back propagation is a critical part of most artificial neural networks, although no such mechanism exists in biological neural networks.[411] How information is coded by real neurons is not known. Sensor neurons fire action potentials more frequently with sensor activation and muscle cells pull more strongly when their associated motor neurons receive action potentials more frequently.[412] Other than the case of relaying information from a sensor neuron to a motor neuron, almost nothing of the principles of how information is handled by biological neural networks is known. This is a subject of active research in neural coding.

The motivation behind artificial neural networks is not necessarily to strictly replicate neural function, but to use biological neural networks as an inspiration. A central claim of artificial neural networks is therefore that it embodies some new and powerful general principle for processing information. Unfortunately, these general principles are ill-defined. It is often claimed that they are emergent from the network itself. This allows simple statistical association (the basic function of artificial neural networks) to be described as learning or recognition. Alexander Dewdney commented that, as a result, artificial neural networks have a "something-for-nothing quality, one that imparts a peculiar aura of laziness and a distinct lack of curiosity about just how good these computing systems are. No human hand (or mind) intervenes; solutions are found as if by magic; and no one, it seems, has learned anything".[413]

Biological brains use both shallow and deep circuits as reported by brain anatomy,[414] displaying a wide variety of invariance. Weng[415] argued that the brain self-wires largely according to signal statistics and therefore, a serial cascade cannot catch all major statistical dependencies.

Hardware issues

Large and effective neural networks require considerable computing resources.[51] While the brain has hardware tailored to the task of processing signals through a graph of neurons, simulating even a simplified neuron on von Neumann architecture may compel a neural network designer to fill many millions of database rows for its connectionsModèle:Snd which can consume vast amounts of memory and storage. Furthermore, the designer often needs to transmit signals through many of these connections and their associated neuronsModèle:Snd which must often be matched with enormous CPU processing power and time.

Schmidhuber notes that the resurgence of neural networks in the twenty-first century is largely attributable to advances in hardware: from 1991 to 2015, computing power, especially as delivered by GPGPUs (on GPUs), has increased around a million-fold, making the standard backpropagation algorithm feasible for training networks that are several layers deeper than before.[416] The use of accelerators such as FPGAs and GPUs can reduce training times from months to days.[417]Modèle:R

Neuromorphic engineering addresses the hardware difficulty directly, by constructing non-von-Neumann chips to directly implement neural networks in circuitry. Another chip optimized for neural network processing is called a Tensor Processing Unit, or TPU.[418]

Practical counterexamples to criticisms

Arguments against Dewdney's position are that neural networks have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft[419] to detecting credit card fraud to mastering the game of Go.

Technology writer Roger Bridgman commented:

Modèle:Quote

Although it is true that analyzing what has been learned by an artificial neural network is difficult, it is much easier to do so than to analyze what has been learned by a biological neural network. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering general principles that allow a learning machine to be successful. For example, local vs non-local learning and shallow vs deep architecture.[420]

Hybrid approaches

Advocates of hybrid models (combining neural networks and symbolic approaches), claim that such a mixture can better capture the mechanisms of the human mind.[421][422]

Types

Modèle:Main

Artificial neural networks have many variations. The simplest, static types have one or more static components, including number of units, number of layers, unit weights and topology. Dynamic types allow one or more of these to change during the learning process. The latter are much more complicated, but can shorten learning periods and produce better results. Some types allow/require learning to be "supervised" by the operator, while others operate independently. Some types operate purely in hardware, while others are purely software and run on general purpose computers.

Gallery

See also

Modèle:Too many see alsos Modèle:Columns-list

References

Modèle:Reflist

Bibliography

Modèle:Div col

Modèle:Div col end

Modèle:CPU technologies

Attention : la clé de tri par défaut « Artificial Neural Network » écrase la précédente clé « Neural Network Software ».



Modèle:Voir homonymes Modèle:À sourcer Un réseau de neurones artificiels, ou réseau neuronal artificiel, est un système dont la conception est à l'origine schématiquement inspirée du fonctionnement des neurones biologiques, et qui par la suite s'est rapproché des méthodes statistiques[423].

Les réseaux de neurones sont généralement optimisés par des méthodes d’apprentissage de type probabiliste, en particulier bayésien. Ils sont placés d’une part dans la famille des applications statistiques, qu’ils enrichissent avec un ensemble de paradigmes [424] permettant de créer des classifications rapides (réseaux de Kohonen en particulier), et d’autre part dans la famille des méthodes de l’intelligence artificielle auxquelles ils fournissent un mécanisme perceptif indépendant des idées propres de l'implémenteur, et fournissant des informations d'entrée au raisonnement logique formel (voir Deep Learning).

En modélisation des circuits biologiques, ils permettent de tester quelques hypothèses fonctionnelles issues de la neurophysiologie, ou encore les conséquences de ces hypothèses pour les comparer au réel.

Historique

Fichier:Neural network.svg
Vue simplifiée d'un réseau artificiel de neurones

Les réseaux neuronaux sont construits sur un paradigme biologique, celui du neurone formel (comme les algorithmes génétiques le sont sur la sélection naturelle). Ces types de métaphores biologiques sont devenues courantes avec les idées de la cybernétique et biocybernétique. Selon la formule de Yann Le Cun, celui-ci ne prétend pas davantage décrire le cerveau qu'Modèle:Pertinence détail[425]. En particulier le rôle des cellules gliales n'est pas simulé pour le moment (2010).

Neurone formel

Les neurologues Warren McCulloch et Walter Pitts publièrent dès la fin des années 1950 les premiers travaux sur les réseaux de neurones, avec un article fondateur : What the frog’s eye tells the frog’s brain[426] (Ce que l'œil d'une grenouille dit à son cerveau). Ils constituèrent ensuite un modèle simplifié de neurone biologique communément appelé neurone formel. Ils montrèrent que des réseaux de neurones formels simples peuvent théoriquement réaliser des fonctions logiques, arithmétiques et symboliques complexes.

Le neurone formel est conçu comme un automate doté d'une fonction de transfert qui transforme ses entrées en sortie selon des règles précises. Par exemple, un neurone somme ses entrées, compare la somme résultante à une valeur seuil, et répond en émettant un signal si cette somme est supérieure ou égale à ce seuil (modèle ultra-simplifié du fonctionnement d'un neurone biologique). Ces neurones sont par ailleurs associés en réseaux dont la topologie des connexions est variable : réseaux proactifs, récurrents, etc. Enfin, l'efficacité de la transmission des signaux d'un neurone à l'autre peut varier : on parle de « poids synaptique », et ces poids peuvent être modulés par des règles d'apprentissage (ce qui mime la plasticité synaptique des réseaux biologiques).

Une fonction des réseaux de neurones formels, à l’instar du modèle vivant, est d'opérer rapidement des classifications et d'apprendre à les améliorer. À l’opposé des méthodes traditionnelles de résolution informatique, on ne doit pas construire un programme pas à pas en fonction de la compréhension de celui-ci. Les paramètres importants de ce modèle sont les coefficients synaptiques et le seuil de chaque neurone, et la façon de les ajuster. Ce sont eux qui déterminent l'évolution du réseau en fonction de ses informations d'entrée. Il faut choisir un mécanisme permettant de les calculer et de les faire converger si possible vers une valeur assurant une classification aussi proche que possible de l'optimale. C’est ce qu'on nomme la phase d’apprentissage du réseau. Dans un modèle de réseaux de neurones formels, apprendre revient donc à déterminer les coefficients synaptiques les plus adaptés à classifier les exemples présentés.

Perceptron

Les travaux de McCulloch et Pitts n’ont pas donné d’indication sur une méthode pour adapter les coefficients synaptiques. Cette question au cœur des réflexions sur l’apprentissage a connu un début de réponse grâce aux travaux du physiologiste canadien Donald Hebb sur l’apprentissage en 1949 décrits dans son ouvrage The Organization of Behaviour. Hebb a proposé une règle simple qui permet de modifier la valeur des coefficients synaptiques en fonction de l’activité des unités qu’ils relient. Cette règle aujourd’hui connue sous le nom de « règle de Hebb » est presque partout présente dans les modèles actuels, même les plus sophistiqués.

Fichier:RecurrentLayerNeuralNetwork.png
Réseau de neurones avec rétroaction

À partir de cet article, l’idée se sema au fil du temps dans les esprits, et elle germa dans l’esprit de Frank Rosenblatt en 1957 avec le modèle du perceptron. C’est le premier système artificiel capable d’apprendre par expérience, y compris lorsque son instructeur commet quelques erreurs (ce en quoi il diffère nettement d’un système d’apprentissage logique formel).

En 1969, un coup grave fut porté à la communauté scientifique gravitant autour des réseaux de neurones : Marvin Lee Minsky et Seymour Papert publièrent un ouvrage mettant en exergue quelques limitations théoriques du perceptron, et plus généralement des classifieurs linéaires, notamment l’impossibilité de traiter des problèmes non linéaires ou de connexité. Ils étendirent implicitement ces limitations à tous modèles de réseaux de neurones artificiels. Paraissant alors dans une impasse, la recherche sur les réseaux de neurones perdit une grande partie de ses financements publics, et le secteur industriel s’en détourna aussi. Les fonds destinés à l’intelligence artificielle furent redirigés plutôt vers la logique formelle[427]. Cependant, les solides qualités de certains réseaux de neurones en matière adaptative (e.g. Adaline), leur permettant de modéliser de façon évolutive des phénomènes eux-mêmes évolutifs, les amèneront à être intégrés sous des formes plus ou moins explicites dans le corpus des systèmes adaptatifs; utilisés dans le domaine des télécommunications ou celui du contrôle de processus industriels.

En 1982, John Joseph Hopfield, physicien reconnu, donna un nouveau souffle au neuronal en publiant un article introduisant un nouveau modèle de réseau de neurones (complètement récurrent)[428]. Cet article eut du succès pour plusieurs raisons, dont la principale était de teinter la théorie des réseaux de neurones de la rigueur propre aux physiciens. Le neuronal redevint un sujet d’étude acceptable, bien que le modèle de Hopfield souffrît des principales limitations des modèles des années 1960, notamment l’impossibilité de traiter les problèmes non linéaires.

Perceptron multicouche

À la même date, les approches algorithmiques de l’intelligence artificielle furent l’objet de désillusion, leurs applications ne répondant pas aux attentes. Cette désillusion motiva une réorientation des recherches en intelligence artificielle vers les réseaux de neurones (bien que ces réseaux concernent la perception artificielle plus que l’intelligence artificielle à proprement parler). La recherche fut relancée et l’industrie reprit quelque intérêt au neuronal (en particulier pour des applications comme le guidage de missiles de croisière). En 1984 (?), c’est le système de rétropropagation du gradient qui est le sujet le plus débattu dans le domaine.

Une révolution survient alors dans le domaine des réseaux de neurones artificiels : une nouvelle génération de réseaux de neurones, capables de traiter avec succès des phénomènes non linéaires : le perceptron multicouche ne possède pas les défauts mis en évidence par Marvin Minsky. Proposé pour la première fois par Paul Werbos, le perceptron multi-couche apparait en 1986 introduit par David Rumelhart, et, simultanément, sous une appellation voisine, chez Yann LeCun. Ces systèmes reposent sur la rétropropagation du gradient de l’erreur dans des systèmes à plusieurs couches, chacune de type Adaline de Bernard Widrow, proche du perceptron de Rumelhart.

Les réseaux de neurones ont par la suite connu un essor considérable, et ont fait partie des premiers systèmes à bénéficier de l’éclairage de la théorie de la « régularisation statistique » introduite par Vladimir Vapnik en Union soviétique et popularisée en Occident depuis la chute du mur. Cette théorie, l’une des plus importantes du domaine des statistiques, permet d’anticiper, d’étudier et de réguler les phénomènes liés au surapprentissage. On peut ainsi réguler un système d’apprentissage pour qu’il arbitre au mieux entre une modélisation pauvre (exemple : la moyenne) et une modélisation trop riche qui serait optimisée de façon illusoire sur un nombre d’exemples trop petit, et serait inopérant sur des exemples non encore appris, même proches des exemples appris. Le surapprentissage est une difficulté à laquelle doivent faire face tous les systèmes d’apprentissage par l’exemple, que ceux-ci utilisent des méthodes d’optimisation directe (e.g. régression linéaire), itératives (e.g., l'algorithme du gradient), ou itératives semi-directes (gradient conjugué, espérance-maximisation...) et que ceux-ci soient appliqués aux modèles statistiques classiques, aux modèles de Markov cachés ou aux réseaux de neurones formels[429].

Utilité

Les réseaux de neurones, en tant que systèmes capables d'apprendre, mettent en œuvre le principe de l'induction, c’est-à-dire l'apprentissage par l'expérience. Par confrontation avec des situations ponctuelles, ils infèrent un système de décision intégré dont le caractère générique est fonction du nombre de cas d'apprentissages rencontrés et de leur complexité par rapport à la complexité du problème à résoudre. Par opposition, les systèmes symboliques capables d'apprentissage, s'ils implémentent également l'induction, le font sur base de la logique algorithmique, par complexification d'un ensemble de règles déductives (Prolog par exemple).

Grâce à leur capacité de classification et de généralisation, les réseaux de neurones sont généralement utilisés dans des problèmes de nature statistique, tels que la classification automatique de codes postaux ou la prise de décision concernant un achat boursier en fonction de l'évolution des cours. Autre exemple, une banque peut créer un jeu de données sur les clients qui ont effectué un emprunt constitué : de leur revenu, de leur âge, du nombre d’enfants à charge… et s’il s’agit d’un bon client. Si ce jeu de données est suffisamment grand, il peut être utilisé pour l’entraînement d’un réseau de neurones. La banque pourra alors présenter les caractéristiques d’un potentiel nouveau client, et le réseau répondra s’il sera bon client ou non, en généralisant à partir des cas qu’il connaît.

Si le réseau de neurones fonctionne avec des nombres réels, la réponse traduit une probabilité de certitude. Par exemple : 1 pour « sûr qu’il sera un bon client », -1 pour « sûr qu’il sera mauvais client », 0 pour « aucune idée », 0,9 pour « presque sûr qu’il sera bon client ».

Le réseau de neurones ne fournit pas toujours de règle exploitable par un humain. Le réseau reste souvent une boîte noire qui fournit une réponse quand on lui présente une donnée, mais le réseau ne fournit pas de justification facile à interpréter.

Les réseaux de neurones sont réellement utilisés, par exemple :

  • pour la classification d’espèces animales par espèce étant donnée une analyse ADN.
  • reconnaissance de motif ; par exemple pour la reconnaissance optique de caractères (OCR), et notamment par les banques pour vérifier le montant des chèques, par La Poste pour trier le courrier en fonction du code postal, etc. ; ou bien encore pour le déplacement automatisé de robots mobiles autonomes.
  • approximation d’une fonction inconnue.
  • modélisation accélérée d’une fonction connue mais très complexe à calculer avec exactitude ; par exemple certaines fonctions d’inversions utilisées pour décoder les signaux de télédétection émis par les satellites et les transformer en données sur la surface de la mer.
  • estimations boursières :
    • apprentissage de la valeur d’une entreprise en fonction des indices disponibles : bénéfices, endettements à long et court terme, chiffre d’affaires, carnet de commandes, indications techniques de conjoncture. Ce type d’application ne pose pas en général de problème
    • tentatives de prédiction sur la périodicité des cours boursiers. Ce type de prédiction est très contesté pour deux raisons, l’une étant qu'il n'est pas évident que le cours d’une action ait de façon tout à fait convaincante un caractère périodique (le marché anticipe en effet largement les hausses comme les baisses prévisibles, ce qui applique à toute périodicité éventuelle une variation de période tendant à la rendre difficilement fiable), et l’autre que l’avenir prévisible d’une entreprise détermine au moins aussi fortement le cours de son action, si ce n'est plus encore que peut le faire son passé ; les cas de Pan Am, Manufrance ou IBM permettent de s’en convaincre.
  • modélisation de l'apprentissage et amélioration des techniques de l'enseignement.
  • en météorologie, pour la classification de conditions atmosphériques et la prévision statistique du temps.
  • en auscultation des ouvrages hydrauliques, pour la compréhension physique des phénomènes de déplacements, sous-pressions et débits de fuite.

Limites

Les réseaux de neurones artificiels ont besoin de cas réels servant d’exemples pour leur apprentissage (on appelle cela la base d'apprentissage). Ces cas doivent être d’autant plus nombreux que le problème est complexe et que sa topologie est peu structurée. Ainsi on peut optimiser un système neuronal de lecture de caractères en utilisant le découpage manuel d’un grand nombre de mots écrits à la main par de nombreuses personnes. Chaque caractère peut alors être présenté sous la forme d’une image brute, disposant d’une topologie spatiale à deux dimensions, ou d’une suite de segments presque tous liés. La topologie retenue, la complexité du phénomène modélisé, et le nombre d’exemples doivent être en rapport. Sur un plan pratique, cela n’est pas toujours facile car les exemples peuvent être soit en quantité absolument limitée ou trop onéreux à collecter en nombre suffisant.

Il y a des problèmes qui se traitent bien avec les réseaux de neurones, en particulier ceux de classification en domaines convexes (c’est-à-dire tels que si des points A et B font partie du domaine, alors tout le segment AB en fait partie aussi). Des problèmes comme « Le nombre d’entrées à 1 (ou à zéro) est-il pair ou impair ? » se résolvent en revanche très mal : pour affirmer de telles choses sur 2 puissance N points, si on se contente d’une approche naïve mais homogène, il faut précisément N-1 couches de neurones intermédiaires, ce qui nuit à la généralité du procédé.

Un exemple caricatural, mais significatif est le suivant : disposant en entrée du seul poids d'une personne, le réseau doit déterminer si cette personne est une femme ou bien un homme. Les femmes étant statistiquement un peu plus légères que les hommes, le réseau fera toujours un peu mieux qu'un simple tirage au hasard : cet exemple dépouillé indique la simplicité et les limitations de ces modèles mais il montre également comment l'étendre : l'information « port d'une jupe », si on l'ajoute, aurait clairement un coefficient synaptique plus grand que la simple information de poids.

Opacité

Les réseaux complexes de neurones artificiels ne peuvent généralement pas expliquer eux-mêmes leur façon de « penser ». Les calculs aboutissant à un résultat ne sont pas visibles pour les programmeurs qui ont créé le réseau neuronal[430]. Une Modèle:Citation a donc été créée pour étudier la boîte noire que constituent les réseaux de neurones, science qui pourrait permettre d’augmenter la confiance dans les résultats produits par ces réseaux ou les intelligences artificielles qui les utilisent[430].

Modèle

Structure du réseau

Fichier:ArtificialNeuronModel francais.png
Structure d'un neurone artificiel. Le neurone calcule la somme de ses entrées puis cette valeur passe à travers la fonction d'activation pour produire sa sortie.

Un réseau de neurones est en général composé d'une succession de couches dont chacune prend ses entrées sur les sorties de la précédente. Chaque couche (i) est composée de Ni neurones, prenant leurs entrées sur les Ni-1 neurones de la couche précédente. À chaque synapse est associé un poids synaptique, de sorte que les Ni-1 sont multipliés par ce poids, puis additionnés par les neurones de niveau i, ce qui est équivalent à multiplier le vecteur d'entrée par une matrice de transformation. Mettre l'une derrière l'autre les différentes couches d'un réseau de neurones reviendrait à mettre en cascade plusieurs matrices de transformation et pourrait se ramener à une seule matrice, produit des autres, s'il n'y avait à chaque couche, la fonction de sortie qui introduit une non linéarité à chaque étape. Ceci montre l'importance du choix judicieux d'une bonne fonction de sortie : un réseau de neurones dont les sorties seraient linéaires n'aurait aucun intérêt.

Au-delà de cette structure simple, le réseau de neurones peut également contenir des boucles qui en changent radicalement les possibilités mais aussi la complexité. De la même façon que des boucles peuvent transformer une logique combinatoire en logique séquentielle, les boucles dans un réseau de neurones transforment un simple dispositif de reconnaissance d'entrées en une machine complexe capable de toutes sortes de comportements.

Fonction de combinaison

Considérons un neurone quelconque.

Il reçoit des neurones en amont un certain nombre de valeurs via ses connexions synaptiques, et il produit une certaine valeur en utilisant une fonction de combinaison. Cette fonction peut donc être formalisée comme étant une fonction vecteur-à-scalaire, notamment :

  • Les réseaux de type MLP (multi-layer perceptron) calculent une combinaison linéaire des entrées, c’est-à-dire que la fonction de combinaison renvoie le produit scalaire entre le vecteur des entrées et le vecteur des poids synaptiques.
  • Les réseaux de type RBF (radial basis function) calculent la distance entre les entrées, c’est-à-dire que la fonction de combinaison renvoie la norme euclidienne du vecteur issu de la différence vectorielle entre les vecteurs d’entrées.

Fonction d’activation

La fonction d’activation (ou fonction de seuillage, ou encore fonction de transfert) sert à introduire une non-linéarité dans le fonctionnement du neurone.

Les fonctions de seuillage présentent généralement trois intervalles :

  1. en dessous du seuil, le neurone est non-actif (souvent dans ce cas, sa sortie vaut 0 ou -1) ;
  2. aux alentours du seuil, une phase de transition ;
  3. au-dessus du seuil, le neurone est actif (souvent dans ce cas, sa sortie vaut 1).

Des exemples classiques de fonctions d’activation sont :

  1. La fonction sigmoïde.
  2. La fonction tangente hyperbolique.
  3. La fonction de Heaviside.

La logique bayésienne, dont le théorème de Cox-Jaynes formalise les questions d’apprentissage, fait intervenir aussi une fonction en S qui revient de façon récurrente : <math>ev(p) = 10 \log \left(\frac{p}{1-p}\right)</math>

Propagation de l’information

Ce calcul effectué, le neurone propage son nouvel état interne sur son axone. Dans un modèle simple, la fonction neuronale est simplement une fonction de seuillage : elle vaut 1 si la somme pondérée dépasse un certain seuil ; 0 sinon. Dans un modèle plus riche, le neurone fonctionne avec des nombres réels (souvent compris dans l’intervalle [0,1] ou [-1,1]). On dit que le réseau de neurones passe d'un état à un autre lorsque tous ses neurones recalculent en parallèle leur état interne, en fonction de leurs entrées.

Apprentissage

Base théorique

La notion d’apprentissage, bien que connue déjà depuis Sumer, n’est pas modélisable dans le cadre de la logique déductive : celle-ci en effet procède à partir de connaissances déjà établies dont on tire des connaissances dérivées. Or il s’agit ici de la démarche inverse : par observations limitées, tirer des généralisations plausibles : c'est un procédé par induction.

La notion d’apprentissage recouvre deux réalités souvent traitées de façon successive :

  • mémorisation : le fait d’assimiler sous une forme dense des exemples éventuellement nombreux,
  • généralisation : le fait d’être capable, grâce aux exemples appris, de traiter des exemples distincts, encore non rencontrés, mais similaires.

Dans le cas des systèmes d’apprentissage statistique, utilisés pour optimiser les modèles statistiques classiques, réseaux de neurones et automates markoviens, c’est la généralisation qui est l’objet de toute l’attention.

Cette notion de généralisation est traitée de façon plus ou moins complète par plusieurs approches théoriques.

  • La généralisation est traitée de façon globale et générique par la théorie de la régularisation statistique introduite par Vladimir Vapnik. Cette théorie, développée à l’origine en Union soviétique, s’est diffusée en Occident depuis la chute du mur de Berlin. La théorie de la régularisation statistique s’est diffusée très largement parmi ceux qui étudient les réseaux de neurones en raison de la forme générique des courbes d’erreurs résiduelles d’apprentissage et de généralisation issues des procédures d’apprentissage itératives telles que les descentes de gradient utilisées pour l’optimisation des perceptrons multi-couches. Ces formes génériques correspondent aux formes prévues par la théorie de la régularisation statistique ; cela vient du fait que les procédures d’apprentissage par descente de gradient, partant d’une configuration initiale des poids synaptiques explorent progressivement l’espace des poids synaptiques possibles ; on retrouve alors la problématique de l’augmentation progressive de la capacité d’apprentissage, concept fondamental au cœur de la théorie de la régularisation statistique.
  • La généralisation est aussi au cœur de l’approche de l'inférence bayésienne, enseignée depuis plus longtemps. Le théorème de Cox-Jaynes fournit ainsi une base importante à un tel apprentissage, en nous apprenant que toute méthode d’apprentissage est soit isomorphe aux probabilités munies de la relation de Bayes, soit incohérente. C’est là un résultat extrêmement fort, et c’est pourquoi les méthodes bayésiennes sont largement utilisées dans le domaine.

Classe de problèmes solubles

En fonction de la structure du réseau, différents types de fonction sont approchables grâce aux réseaux de neurones :

Fonctions représentables par un perceptron

Un perceptron (un réseau à une unité) peut représenter les fonctions booléennes suivantes : and, or, nand, nor mais pas le xor. Comme toute fonction booléenne est représentable à l'aide de ces fonctions, un réseau de perceptrons est capable de représenter toutes les fonctions booléennes. En effet les fonctions nand et nor sont dites universelles : on peut par combinaison de l'une de ces fonctions représenter toutes les autres.

Fonctions représentables par des réseaux de neurones multicouches acycliques

  • Fonctions booléennes : toutes les fonctions booléennes sont représentables par un réseau à deux couches. Au pire des cas, le nombre de neurones de la couche cachée augmente de manière exponentielle en fonction du nombre d'entrées.
  • Fonctions continues : toutes les fonctions continues bornées sont représentables, avec une précision arbitraire, par un réseau à deux couches (Cybenko, 1989). Ce théorème s'applique au réseau dont les neurones utilisent la sigmoïde dans la couche cachée et des neurones linéaires (sans seuil) dans la couche de sortie. Le nombre de neurones dans la couche cachée dépend de la fonction à approximer.
  • Fonctions arbitraires : n'importe quelle fonction peut être approximée avec une précision arbitraire grâce à un réseau à trois couches (théorème de Cybenko, 1988).

Algorithme

La large majorité des réseaux de neurones possède un algorithme « d’entraînement » qui consiste à modifier les poids synaptiques en fonction d’un jeu de données présentées en entrée du réseau. Le but de cet entraînement est de permettre au réseau de neurones d'« apprendre » à partir des exemples. Si l’entraînement est correctement réalisé, le réseau est capable de fournir des réponses en sortie très proches des valeurs d’origine du jeu de données d’entraînement. Mais tout l’intérêt des réseaux de neurones réside dans leur capacité à généraliser à partir du jeu de test. Il est donc possible d'utiliser un réseau de neurones pour réaliser une mémoire ; on parle alors de mémoire neuronale.

La vision topologique d’un apprentissage correspond à la détermination de l’hypersurface sur <math>\mathbb{R}^n</math> où <math>\mathbb{R}</math> est l’ensemble des réels, et <math>n</math> le nombre d’entrées du réseau.

Apprentissage

Mode supervisé ou non

Un apprentissage est dit supervisé lorsque le réseau est forcé à converger vers un état final précis, en même temps qu'un motif lui est présenté.

À l’inverse, lors d’un apprentissage non-supervisé, le réseau est laissé libre de converger vers n’importe quel état final lorsqu'un motif lui est présenté.

Surapprentissage

Il arrive souvent que les exemples de la base d'apprentissage comportent des valeurs approximatives ou bruitées. Si on oblige le réseau à répondre de façon quasi parfaite relativement à ces exemples, on peut obtenir un réseau qui est biaisé par des valeurs erronées.

Par exemple, imaginons qu'on présente au réseau des couples <math>(x_i, f(x_i))</math> situés sur une droite d'équation <math>y=ax+b</math>, mais bruités de sorte que les points ne soient pas exactement sur la droite. S'il y a un bon apprentissage, le réseau répond <math>ax+b</math> pour toute valeur de <math>x</math> présentée. S'il y a surapprentissage, le réseau répond un peu plus que <math>ax+b</math> ou un peu moins, car chaque couple <math>(x_i, f(x_i))</math> positionné en dehors de la droite va influencer la décision : il aura appris le bruit en plus, ce qui n'est pas souhaitable.

Pour éviter le surapprentissage, il existe une méthode simple : il suffit de partager la base d'exemples en 2 sous-ensembles. Le premier sert à l'apprentissage et le second sert à l'évaluation de l'apprentissage. Tant que l'erreur obtenue sur le deuxième ensemble diminue, on peut continuer l'apprentissage, sinon on arrête.

Rétropropagation

La rétropropagation consiste à rétropropager l'erreur commise par un neurone à ses synapses et aux neurones qui y sont reliés. Pour les réseaux de neurones, on utilise habituellement la rétropropagation du gradient de l'erreur, qui consiste à corriger les erreurs selon l'importance des éléments qui ont justement participé à la réalisation de ces erreurs : les poids synaptiques qui contribuent à engendrer une erreur importante se verront modifiés de manière plus significative que les poids qui ont engendré une erreur marginale.

Élagage

L'élagage (pruning, en anglais) est une méthode qui permet d'éviter le surapprentissage tout en limitant la complexité du modèle. Elle consiste à supprimer des connexions (ou synapses), des entrées ou des neurones du réseau une fois l'apprentissage terminé. En pratique, les éléments qui ont la plus petite influence sur l'erreur de sortie du réseau sont supprimés. Les deux algorithmes d'élagage les plus utilisés sont :

  • Optimal brain damage (OBD) de Yann LeCun et al.
  • Optimal brain surgeon (OBS) de B. Hassibi et D. G. Stork

Différents types de réseaux de neurones

L’ensemble des poids des liaisons synaptiques détermine le fonctionnement du réseau de neurones. Les motifs sont présentés à un sous-ensemble du réseau de neurones : la couche d’entrée. Lorsqu’un motif est appliqué à un réseau, celui-ci cherche à atteindre un état stable. Lorsqu’il est atteint, les valeurs d’activation des neurones de sortie constituent le résultat. Les neurones qui ne font ni partie de la couche d’entrée ni de la couche de sortie sont dits neurones cachés.

Les types de réseau de neurones diffèrent par plusieurs paramètres :

  • la topologie des connexions entre les neurones ;
  • la fonction d’agrégation utilisée (somme pondérée, distance pseudo-euclidienne…) ;
  • la fonction de seuillage utilisée (sigmoïde, échelon, fonction linéaire, fonction de Gauss…) ;
  • l’algorithme d’apprentissage (rétropropagation du gradient, cascade correlation) ;
  • d’autres paramètres, spécifiques à certains types de réseaux de neurones, tels que la méthode de relaxation pour les réseaux de neurones (réseaux de Hopfield par exemple) qui ne sont pas à propagation simple (perceptron multicouche par exemple).

De nombreux autres paramètres sont susceptibles d’être mis en œuvre dans le cadre de l’apprentissage de ces réseaux de neurones par exemple :

  • la méthode de dégradation des pondérations (weight decay), permettant d’éviter les effets de bord et de neutraliser le surapprentissage.

Réseaux à apprentissages supervisés

Sans rétropropagation

Perceptron

Modèle:Article détaillé

ADALINE (adaptive linear neuron)

Le réseau ADALINE est proche du modèle perceptron, seule sa fonction d'activation est différente puisqu'il utilise une fonction linéaire. Afin de réduire les parasites reçus en entrée, les réseaux ADALINE utilisent la méthode des moindres carrés.

Le réseau réalise une somme pondérée de ses valeurs d'entrées et y rajoute une valeur de seuil prédéfinie. La fonction de transfert linéaire est ensuite utilisée pour l'activation du neurone. Lors de l'apprentissage, les coefficients synaptiques des différentes entrées sont modifiées en utilisant la loi de Modèle:Lien. Ces réseaux sont souvent employés en traitement de signaux[431], notamment pour la réduction de bruit.

Machine de Cauchy

Une machine de Cauchy est un réseau de neurones artificiels assez proche dans le fonctionnement d'une machine de Boltzmann. Cependant les lois de probabilités utilisées ne sont pas les mêmes[432].

Non détaillés
  1. Adaptive heuristic critic (AHC)
  2. Time delay neural network (TDNN)
  3. Associative reward penalty (ARP)
  4. Avalanche matched filter (AMF)
  5. Backpercolation (Perc)
  6. Artmap
  7. Adaptive logic network (ALN)
  8. Cascade correlation (CasCor)
  9. Extended Kalman filter(EKF)
  10. Learning vector quantization (LVQ)
  11. Probabilistic neural network (PNN)
  12. General regression neural network (GRNN)

Avec rétropropagation

Perceptron multicouche

Modèle:Article détaillé

Non détaillés
  1. Brain-State-in-a-Box (BSB)
  2. Fuzzy cognitive map (FCM)
  3. Mean field annealing (MFT)
  4. Recurrent cascade correlation (RCC)
  5. Backpropagation through time (BPTT)
  6. Real-time recurrent learning (RTRL)
  7. Recurrent extended Kalman filter (EKF)

Réseaux à apprentissage non supervisé

Avec rétropropagation

  1. Carte auto adaptative
  2. Machine de Boltzmann restreinte (RBM)
  3. Codage parcimonieux
Non détaillés
  1. Additive Grossberg (AG)
  2. Shunting Grossberg (SG)
  3. Binary adaptive resonance theory (ART1)
  4. Analog adaptive resonance theory (ART2, ART2a)
  5. Discrete Hopfield (DH)
  6. Continuous Hopfield (CH)
  7. Chaos fractal[433]Modèle:,[434]Modèle:,[435]
  8. Discrete bidirectional associative memory (BAM)
  9. Temporal associative memory (TAM)
  10. Adaptive bidirectional associative memory (ABAM)
  11. Apprentissage compétitif

Dans ce type d'apprentissage non supervisé, les neurones sont en compétition pour être actifs. Ils sont à sortie binaire et on dit qu'ils sont actifs lorsque leur sortie vaut 1. Alors que dans les autres règles plusieurs sorties de neurones peuvent être actives simultanément, dans le cas de l'apprentissage compétitif, un seul neurone est actif à un instant donné. Chaque neurone de sortie est spécialisé pour « détecter » une suite de formes similaires et devient alors un détecteur de caractéristiques. La fonction d’entrée est dans ce cas, <math>h = \operatorname{b-dist}(W, X)</math> où <math>b</math>, <math>W</math> et <math>X</math> sont respectivement les vecteurs seuil, poids synaptiques et entrées. Le neurone gagnant est celui pour lequel h est maximum donc si les seuils sont identiques, celui dont les poids sont les plus proches des entrées. Le neurone dont la sortie est maximale sera le vainqueur et sa sortie sera mise à 1 alors que les perdants auront leur sortie mise à 0. Un neurone apprend en déplaçant ses poids vers les valeurs des entrées qui l'activent pour augmenter ses chances de gagner. Si un neurone ne répond pas à une entrée, aucun ajustement de poids n'intervient. Si un neurone gagne, une portion des poids de toutes les entrées est redistribuée vers les poids des entrées actives. L'application de la règle donne les résultats suivants (Grossberg) :

  • <math>D w_{ij} = lr(x_j - w_{ij})</math> si le neurone i gagne,
  • <math>D w_{ij} = 0</math> si le neurone i perd.

Cette règle a pour effet de rapprocher le vecteur poids synaptique <math>w_{ij}</math> de la forme d'entrée <math>x_j</math>.

Exemple : considérons deux nuages de points du plan que l’on désire séparer en deux classes. <math>x_1</math> et <math>x_2</math> sont les deux entrées, <math>w_{11}</math> et <math>w_{12}</math> sont les poids du neurone 1 que l’on peut considérer comme les coordonnées d’un point ‘poids du neurone 1’ et <math>w_{21}</math> et <math>w_{22}</math> sont les poids du neurone 2. Si les seuils sont nuls, hi sera la distance entre les points à classer et les points poids. La règle précédente tend à diminuer cette distance avec le point échantillon lorsque le neurone gagne. Elle doit donc permettre à chaque point poids de se positionner au milieu d’un nuage. Si on fixe initialement les poids de manière aléatoire, il se peut que l’un des neurones se positionne près des deux nuages et que l’autre se positionne loin de sorte qu’il ne gagne jamais. Ses poids ne pourront jamais évoluer alors que ceux de l’autre neurone vont le positionner au milieu des deux nuages. Le problème de ces neurones que l’on qualifie de morts peut être résolu en jouant sur les seuils. En effet, il suffit d’augmenter le seuil de ces neurones pour qu’ils commencent à gagner.

Applications : ce type de réseau et la méthode d'apprentissage correspondant peuvent être utilisés en analyse de données afin de mettre en évidence des similitudes entre certaines données.

Précisions

S’agissant d’un modèle, les réseaux de neurones sont généralement utilisés dans le cadre de simulation logicielle. IMSL et Matlab disposent ainsi de bibliothèques dédiées aux réseaux de neurones. Cependant, il existe quelques implémentations matérielles des modèles les plus simples, comme la puce ZISC.

Voir aussi

Modèle:Autres projets Modèle:Colonnes

Références

  • Modèle:En Warren Sturgis McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115-133, 1943.
  • Modèle:En Frank Rosenblatt. The Perceptron : probabilistic model for information storage and organization in the brain. Psychological Review, 65:386-408, 1958.
  • Modèle:En Marvin Lee Minsky and Seymour Papert. Perceptrons : an introduction to computational geometry. MIT Press, expanded edition, 1988.
  • Modèle:En John Joseph Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79:2554-2558, 1982.
  • Yann LeCun. Une procédure d’apprentissage pour réseau à seuil asymétrique. COGNITIVA 85, Paris, 4-7 juin 1985.
  • Modèle:En D. E. Rumelhart and J. L. Mc Clelland. Parallel Distributed Processing: Exploration in the MicroStructure of Cognition. MIT Press, Cambridge, 1986.
  • Modèle:En J. A. Anderson and E. Rosenfeld. Neuro Computing Foundations of Research. MIT Press, Cambridge, 1988.
  • Modèle:Mitchell
  • Modèle:Lien web.

Notes et références

  1. 1,0, 1,1, 1,2, 1,3, 1,4 et 1,5 Modèle:Cite journal
  2. 2,0, 2,1, 2,2, 2,3, 2,4, 2,5, 2,6 et 2,7 Modèle:Cite journal
  3. Modèle:Cite journal
  4. 4,0 et 4,1 Modèle:Cite journal
  5. 5,0, 5,1 et 5,2 Modèle:Cite journal
  6. Modèle:Cite web
  7. Modèle:Cite journal
  8. Modèle:Cite journal
  9. Modèle:Cite arxiv
  10. 10,0, 10,1, 10,2, 10,3, 10,4 et 10,5 Modèle:Cite journal
  11. 11,0, 11,1, 11,2, 11,3 et 11,4 Modèle:Cite journal
  12. Modèle:Cite journal
  13. 13,0 et 13,1 Jürgen Schmidhuber (2015). Deep Learning. Scholarpedia, 10(11):32832. Online
  14. 14,0, 14,1, 14,2 et 14,3 Modèle:Cite journal
  15. 15,0 et 15,1 Balázs Csanád Csáji (2001). Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary
  16. 16,0, 16,1 et 16,2 Modèle:Cite journal
  17. 17,0, 17,1 et 17,2 Modèle:Cite journal
  18. 18,0 et 18,1 Modèle:Cite book
  19. 19,0 et 19,1 Modèle:Cite book
  20. 20,0 et 20,1 Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The Expressive Power of Neural Networks: A View from the Width. Neural Information Processing Systems, 6231-6239.
  21. 21,0, 21,1, 21,2 et 21,3 Modèle:Cite book
  22. Modèle:Cite journal
  23. Modèle:Cite arXiv
  24. Modèle:Cite book
  25. Rina Dechter (1986). Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laboratory.Online
  26. Igor Aizenberg, Naum N. Aizenberg, Joos P.L. Vandewalle (2000). Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media.
  27. Co-evolving recurrent neurons learn deep memory POMDPs. Proc. GECCO, Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005.
  28. 28,0 et 28,1 Modèle:Cite book
  29. 29,0, 29,1 et 29,2 Modèle:Cite journal
  30. 30,0 et 30,1 Modèle:Cite journal
  31. 31,0 et 31,1 Seppo Linnainmaa (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.
  32. 32,0 et 32,1 Modèle:Cite journal
  33. Modèle:Cite journal
  34. 34,0 et 34,1 Modèle:Cite book
  35. 35,0, 35,1 et 35,2 LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
  36. 36,0 et 36,1 J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-organizing neural network which grows adaptively," Proc. International Joint Conference on Neural Networks, Baltimore, Maryland, vol I, pp. 576-581, June, 1992.
  37. J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121-128, May, 1993.
  38. 38,0 et 38,1 J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation using the Cresceptron," International Journal of Computer Vision, vol. 25, no. 2, pp. 105-139, Nov. 1997.
  39. Modèle:Cite journal
  40. Modèle:Cite journal
  41. 41,0 et 41,1 S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen," Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
  42. 42,0 et 42,1 Modèle:Cite book
  43. Modèle:Cite journal
  44. Modèle:Cite journal
  45. Modèle:Cite journal
  46. Modèle:Cite journal
  47. Modèle:Cite web
  48. Modèle:Cite journal
  49. 49,0 et 49,1 Modèle:Cite journal
  50. Modèle:Cite web
  51. 51,0, 51,1, 51,2 et 51,3 Modèle:Cite journal
  52. 52,0, 52,1 et 52,2 Modèle:Cite web
  53. 53,0, 53,1 et 53,2 Modèle:Cite journal
  54. 54,0 et 54,1 Santiago Fernandez, Alex Graves, and Jürgen Schmidhuber (2007). An application of recurrent neural networks to discriminative keyword spotting. Proceedings of ICANN (2), pp. 220–229.
  55. 55,0, 55,1 et 55,2 Modèle:Cite web
  56. Modèle:Cite journal
  57. Modèle:Cite journal
  58. Modèle:Cite arXiv
  59. G. E. Hinton., "Learning multiple layers of representation," Trends in Cognitive Sciences, 11, pp. 428–434, 2007.
  60. 60,0, 60,1 et 60,2 Modèle:Cite journal
  61. Modèle:Cite journal
  62. Modèle:Cite journal
  63. 63,0 et 63,1 Modèle:Cite web
  64. 64,0 et 64,1 Modèle:Cite arxiv
  65. 65,0 et 65,1 Modèle:Cite web
  66. Modèle:Cite web
  67. 67,0 et 67,1 Modèle:Cite journal
  68. 68,0 et 68,1 Yann LeCun (2016). Slides on Deep Learning Online
  69. 69,0, 69,1 et 69,2 NIPS Workshop: Deep Learning for Speech Recognition and Related Applications, Whistler, BC, Canada, Dec. 2009 (Organizers: Li Deng, Geoff Hinton, D. Yu).
  70. 70,0 et 70,1 Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013 (by Geoff Hinton).
  71. D. Yu, L. Deng, G. Li, and F. Seide (2011). "Discriminative pretraining of deep neural networks," U.S. Patent Filing.
  72. 72,0, 72,1 et 72,2 Modèle:Cite journal
  73. 73,0, 73,1 et 73,2 Modèle:Cite book
  74. Modèle:Cite web
  75. 75,0 et 75,1 Modèle:Cite web
  76. Modèle:Cite journal
  77. Modèle:Cite journal
  78. Modèle:Cite journal
  79. Modèle:Cite web
  80. Modèle:Cite web
  81. 81,0 et 81,1 Modèle:Cite journal
  82. 82,0 et 82,1 Chellapilla, K., Puri, S., and Simard, P. (2006). High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition.
  83. 83,0 et 83,1 Modèle:Cite journal
  84. Modèle:Cite journal
  85. Modèle:Cite arXiv
  86. 86,0 et 86,1 Modèle:Cite web
  87. 87,0, 87,1 et 87,2 Modèle:Cite web
  88. 88,0 et 88,1 "Toxicology in the 21st century Data Challenge"
  89. 89,0 et 89,1 Modèle:Cite web
  90. 90,0, 90,1, 90,2 et 90,3 Modèle:Cite web
  91. 91,0, 91,1 et 91,2 Modèle:Cite journal
  92. 92,0, 92,1 et 92,2 Modèle:Cite book
  93. Modèle:Cite journal
  94. Modèle:Cite web
  95. Modèle:Cite arxiv.
  96. Modèle:Cite arxiv.
  97. Modèle:Cite arxiv.
  98. Modèle:Cite journal
  99. Modèle:Cite news
  100. Modèle:Cite journal
  101. Modèle:Cite web
  102. Modèle:Cite journal
  103. Modèle:Cite news
  104. 104,0, 104,1 et 104,2 Modèle:Cite journal
  105. 105,0, 105,1, 105,2 et 105,3 Modèle:Cite journal
  106. 106,0, 106,1 et 106,2 Modèle:Cite arxiv
  107. 107,0, 107,1 et 107,2 Modèle:Cite arxiv
  108. Modèle:Cite journal
  109. 109,0, 109,1 et 109,2 Modèle:Cite web
  110. Modèle:Cite journal
  111. Modèle:Cite journal
  112. Modèle:Cite journal
  113. Modèle:Cite web
  114. 114,0 et 114,1 Modèle:Cite journal
  115. Modèle:Cite web
  116. Modèle:Cite journal
  117. 117,0, 117,1, 117,2 et 117,3 Ting Qin, et al. "A learning algorithm of CMAC based on RLS." Neural Processing Letters 19.1 (2004): 49-61.
  118. 118,0 et 118,1 Ting Qin, et al. "Continuous CMAC-QRLS and its systolic array." Neural Processing Letters 22.1 (2005): 1-16.
  119. TIMIT Acoustic-Phonetic Continuous Speech Corpus Linguistic Data Consortium, Philadelphia.
  120. Modèle:Cite journal
  121. Modèle:Cite journal
  122. Modèle:Cite journal
  123. Modèle:Cite journal
  124. Modèle:Cite web
  125. Modèle:Cite arxiv
  126. Modèle:Cite web
  127. Modèle:Cite web
  128. Modèle:Cite journal
  129. Nvidia Demos a Car Computer Trained with "Deep Learning" (2015-01-06), David Talbot, MIT Technology Review
  130. Modèle:Cite web
  131. Modèle:Cite web
  132. Modèle:Cite journal
  133. Modèle:Cite arXiv
  134. 134,0 et 134,1 Modèle:Cite web
  135. Modèle:Cite journal
  136. Modèle:Cite paper
  137. Modèle:Cite journal
  138. Modèle:Cite journal
  139. Modèle:Cite journal
  140. 140,0 et 140,1 Modèle:Cite journal
  141. Modèle:Cite journal
  142. Modèle:Cite news
  143. Modèle:Cite web
  144. 144,0, 144,1, 144,2 et 144,3 Modèle:Cite web
  145. Modèle:Cite journal
  146. Modèle:Cite journal
  147. Modèle:Cite arXiv
  148. "An Infusion of AI Makes Google Translate More Powerful Than Ever." Cade Metz, WIRED, Date of Publication: 09.27.16. https://www.wired.com/2016/09/google-claims-ai-breakthrough-machine-translation/
  149. 149,0 et 149,1 Modèle:Cite web
  150. Modèle:Cite journal
  151. Modèle:Cite journal
  152. Modèle:Cite arXiv
  153. Modèle:Cite web
  154. Modèle:Cite web
  155. Modèle:Cite web
  156. Modèle:Cite arxiv
  157. Modèle:Cite book
  158. Modèle:Cite journal
  159. Modèle:Cite book
  160. Modèle:Cite journal
  161. Modèle:Cite journal
  162. Modèle:Cite web
  163. Modèle:Cite journal
  164. Modèle:Cite conference
  165. Modèle:Cite journal
  166. 166,0, 166,1, 166,2 et 166,3 Modèle:Cite web
  167. Modèle:Cite journal
  168. Modèle:Cite book
  169. Modèle:Cite journal
  170. Modèle:Cite journal
  171. S. Blakeslee., "In brain's early growth, timetable may be critical," The New York Times, Science Section, pp. B5–B6, 1995.
  172. Modèle:Cite journal
  173. Modèle:Cite journal
  174. Modèle:Cite journal
  175. Modèle:Cite journal
  176. Modèle:Cite journal
  177. Modèle:Cite journal
  178. Modèle:Cite journal
  179. Modèle:Cite journal
  180. Modèle:Cite journal
  181. Modèle:Cite journal
  182. Modèle:Cite journal
  183. Modèle:Cite news
  184. Modèle:Cite web
  185. Modèle:Cite journalModèle:Closed access
  186. Modèle:Cite web
  187. Modèle:Cite web
  188. Modèle:Cite web
  189. Modèle:Cite web
  190. Modèle:Cite web
  191. Modèle:Cite web
  192. Modèle:Cite web
  193. Modèle:Cite web
  194. Modèle:Cite web
  195. Modèle:Cite web
  196. Modèle:Cite web
  197. 197,0, 197,1 et 197,2 Modèle:Cite web
  198. Modèle:Cite arxiv
  199. Modèle:Cite arxiv
  200. Modèle:Cite journal
  201. Miller, G. A., and N. Chomsky. "Pattern conception." Paper for Conference on pattern detection, University of Michigan. 1957.
  202. Modèle:Cite web
  203. 203,0, 203,1, 203,2, 203,3, 203,4 et 203,5 Modèle:Cite news
  204. Modèle:Cite journal
  205. Rumelhart, D.E., J.L. McClelland and the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: MIT Press
  206. McClelland, J.L., D.E. Rumelhart and the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models, Cambridge, MA: MIT Press
  207. McClelland and Rumelhart "Explorations in Parallel Distributed Processing Handbook", MIT Press, 1987
  208. Plunkett, K. and Elman, J.L., Exercises in Rethinking Innateness: A Handbook for Connectionist Simulations (The MIT Press, 1997)
  209. Modèle:Cite web
  210. Modèle:Cite web
  211. Modèle:Cite web
  212. Modèle:Cite web
  213. Modèle:Cite journal
  214. Modèle:Cite news
  215. Modèle:Cite book
  216. Modèle:Cite journal
  217. Modèle:Cite journal
  218. Modèle:Cite journal
  219. 219,0 et 219,1 Modèle:Cite book
  220. Modèle:Cite book
  221. 221,0, 221,1, 221,2, 221,3, 221,4 et 221,5 Modèle:Cite journal
  222. Modèle:Cite book
  223. Modèle:Cite book
  224. Modèle:Cite book
  225. Modèle:Cite article
  226. Modèle:Cite article
  227. 227,0 et 227,1 J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121–128, May, 1993.
  228. Dominik Scherer, Andreas C. Müller, and Sven Behnke: "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition," In 20th International Conference Artificial Neural Networks (ICANN), pp. 92–101, 2010. Modèle:Doi.
  229. 229,0 et 229,1 S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen," Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
  230. J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," Neural Computation, 4, pp. 234–242, 1992.
  231. Modèle:Cite book
  232. Modèle:Cite book
  233. 233,0, 233,1 et 233,2 Modèle:Cite journal
  234. Modèle:Cite journal
  235. Modèle:Cite arXiv
  236. Modèle:Cite journal
  237. Modèle:Cite journal
  238. 2012 Kurzweil AI Interview with Jürgen Schmidhuber on the eight competitions won by his Deep Learning team 2009–2012
  239. Modèle:Cite web
  240. Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), 7–10 December 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552.
  241. 241,0 et 241,1 Modèle:Cite journal
  242. 242,0, 242,1 et 242,2 Modèle:Cite journal
  243. Modèle:Cite journal
  244. 244,0 et 244,1 Modèle:Cite journal
  245. 245,0 et 245,1 Modèle:Cite book
  246. Modèle:Cite journal
  247. Modèle:Cite journal
  248. Modèle:Cite news
  249. Modèle:Cite journal
  250. J. Weng, "Why Have We Passed 'Neural Networks Do not Abstract Well'?," Natural Intelligence: the INNS Magazine, vol. 1, no.1, pp. 13–22, 2011.
  251. Z. Ji, J. Weng, and D. Prokhorov, "Where-What Network 1: Where and What Assist Each Other Through Top-down Connections," Proc. 7th International Conference on Development and Learning (ICDL'08), Monterey, CA, Aug. 9–12, pp. 1–6, 2008.
  252. X. Wu, G. Guo, and J. Weng, "Skull-closed Autonomous Development: WWN-7 Dealing with Scales," Proc. International Conference on Brain-Mind, July 27–28, East Lansing, Michigan, pp. 1–9, 2013.
  253. 253,0, 253,1, 253,2, 253,3 et 253,4 Modèle:Cite book
  254. Modèle:Cite journal
  255. Modèle:Cite journal
  256. Modèle:Cite web
  257. 257,0, 257,1, 257,2 et 257,3 Modèle:Cite journal
  258. Eiji Mizutani, Stuart Dreyfus, Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. Online
  259. Modèle:Cite journal
  260. Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.
  261. Modèle:Cite journal
  262. Modèle:Cite book
  263. Modèle:Cite book
  264. Modèle:Cite journal
  265. Modèle:Cite book
  266. Modèle:Cite journal
  267. Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
  268. Eric A. Wan (1993). "Time series prediction by using a connectionist network with internal delay lines." In Proceedings of the Santa Fe Institute Studies in the Sciences of Complexity, 15: p. 195. Addison-Wesley Publishing Co.
  269. Modèle:Cite journal
  270. Modèle:Cite journal
  271. Modèle:Cite journal
  272. Modèle:Cite arXiv
  273. ESANN. 2009
  274. Modèle:Cite journal
  275. Modèle:Cite conference
  276. Modèle:Cite journal
  277. Modèle:Cite book
  278. Modèle:Cite journal
  279. Modèle:Cite conference
  280. Modèle:Cite conference
  281. Modèle:Cite book
  282. Modèle:Cite conference
  283. Modèle:Cite conference
  284. Modèle:Cite web
  285. Modèle:Cite conference
  286. Modèle:Cite conference
  287. Werbos, Paul J. (1994). The Roots of Backpropagation. From Ordered Derivatives to Neural Networks and Political Forecasting. New York, NY: John Wiley & Sons, Inc.
  288. Modèle:Cite book
  289. Modèle:Cite journal
  290. Modèle:Cite journal
  291. Modèle:Cite web
  292. Modèle:Cite book
  293. Modèle:Cite journal
  294. Modèle:Citation
  295. Modèle:Cite journal
  296. Modèle:Cite book
  297. Modèle:Cite journal
  298. Modèle:Cite arxiv
  299. Modèle:Cite journal
  300. Modèle:Cite journal
  301. Modèle:Cite arxiv
  302. Modèle:Cite arxiv
  303. Modèle:Cite journal
  304. Modèle:Cite journal
  305. Modèle:Cite book
  306. 306,0 et 306,1 Modèle:Cite book
  307. Modèle:Patent
  308. D. Graupe, "Principles of Artificial Neural Networks.3rd Edition", World Scientific Publishers, 2013, pp. 203–274.
  309. Modèle:Cite journal
  310. 310,0 et 310,1 Modèle:Cite journal
  311. Modèle:Harvnb
  312. 312,0 et 312,1 Modèle:Cite journal
  313. D. Graupe, "Principles of Artificial Neural Networks.3rd Edition", World Scientific Publishers", 2013, pp. 253–274.
  314. Modèle:Cite journal
  315. Modèle:Cite journal
  316. Modèle:Cite web
  317. Modèle:Cite book
  318. Modèle:Cite book
  319. Modèle:Cite journal
  320. Modèle:Cite journal
  321. Modèle:Cite journal
  322. Modèle:Harvnb
  323. 323,0, 323,1, 323,2 et 323,3 Modèle:Cite journal
  324. Modèle:Cite web
  325. 325,0, 325,1 et 325,2 Modèle:Cite journal
  326. 326,0 et 326,1 Modèle:Cite journal
  327. Modèle:Cite journal
  328. Modèle:Cite journal
  329. Modèle:Cite journal
  330. Modèle:Cite journal
  331. Modèle:Cite journal
  332. Modèle:Cite journal
  333. Modèle:Cite journal
  334. 334,0 et 334,1 Modèle:Cite conference
  335. Modèle:Cite journal
  336. Modèle:Cite journal
  337. Modèle:Cite journal
  338. Modèle:Cite journal
  339. Modèle:Cite book
  340. Modèle:Cite journal
  341. Modèle:Cite journal
  342. Modèle:Cite journal
  343. Modèle:Cite journal
  344. Modèle:Cite book
  345. Modèle:Cite journal
  346. Modèle:Cite journal
  347. Modèle:Cite book
  348. Modèle:Cite journal
  349. Modèle:Cite journal
  350. Modèle:Cite journal
  351. 351,0 et 351,1 Modèle:Cite arXiv
  352. Modèle:Cite web
  353. S. Das, C.L. Giles, G.Z. Sun, "Learning Context Free Grammars: Limitations of a Recurrent Neural Network with an External Stack Memory," Proc. 14th Annual Conf. of the Cog. Sci. Soc., p. 79, 1992.
  354. Modèle:Cite web
  355. Modèle:Cite journal
  356. Modèle:Cite journal
  357. Modèle:Cite conference
  358. Modèle:Cite journal
  359. Modèle:Cite arxiv
  360. Modèle:Cite arxiv
  361. Modèle:Cite news
  362. Modèle:Cite news
  363. Modèle:Cite web
  364. Modèle:Cite journal
  365. Modèle:Cite web
  366. Modèle:Cite journal
  367. Salakhutdinov, Ruslan, and Geoffrey Hinton. "Semantic hashing." International Journal of Approximate Reasoning 50.7 (2009): 969–978.
  368. Modèle:Cite arXiv
  369. Modèle:Cite arxiv
  370. Modèle:Cite arxiv
  371. Modèle:Cite arxiv
  372. Modèle:Cite news
  373. Modèle:Cite arxiv
  374. Modèle:Cite arxiv
  375. Modèle:Cite web
  376. Modèle:Cite web
  377. Modèle:Cite arxiv
  378. Modèle:Cite journal
  379. Modèle:Cite journal
  380. Modèle:Cite journal
  381. Modèle:Cite journal
  382. Modèle:Cite arxiv
  383. Modèle:Cite journal
  384. Modèle:Cite journal
  385. Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484.
  386. Modèle:Cite journal
  387. Choy, Christopher B., et al. "3d-r2n2: A unified approach for single and multi-view 3d object reconstruction." European conference on computer vision. Springer, Cham, 2016.
  388. Modèle:Cite journal
  389. Modèle:Cite news
  390. Modèle:Cite web
  391. Modèle:Cite web
  392. Modèle:Cite journal
  393. Modèle:Cite journal
  394. Modèle:Cite journal
  395. Modèle:Cite journal
  396. Modèle:Cite journal
  397. Modèle:Cite journal
  398. Modèle:Cite journal
  399. Modèle:Cite journal
  400. Modèle:Cite journal
  401. Modèle:Cite journal
  402. Modèle:Cite journal
  403. Modèle:Cite web
  404. Modèle:Cite web
  405. Modèle:Cite web
  406. Modèle:Citation
  407. Modèle:Cite journal
  408. Modèle:Cite journal
  409. Modèle:Cite journal
  410. Modèle:Cite journal
  411. Modèle:Cite journal
  412. Modèle:Cite journal
  413. Modèle:Cite book
  414. D. J. Felleman and D. C. Van Essen, "Distributed hierarchical processing in the primate cerebral cortex," Cerebral Cortex, 1, pp. 1–47, 1991.
  415. J. Weng, "Natural and Artificial Intelligence: Introduction to Computational Brain-Mind," BMI Press, Modèle:ISBN, 2012.
  416. Modèle:Cite journal
  417. "A Survey of FPGA-based Accelerators for Convolutional Neural Networks", NCAA, 2018
  418. Modèle:Cite news
  419. NASA – Dryden Flight Research Center – News Room: News Releases: NASA NEURAL NETWORK PROJECT PASSES MILESTONE. Nasa.gov. Retrieved on 2013-11-20.
  420. Modèle:Cite web
  421. Sun and Bookman (1990)
  422. Modèle:Cite journal
  423. Modèle:Harv
  424. Ces paradigmes correspondent aux différents types d'apprentissage par réseau neuronal, notamment les apprentissages supervisé ou non et l'apprentissage par renforcement.
  425. Modèle:Lien web
  426. Lettvin, J.Y., Maturana, H.R., McCulloch, W.S., & Pitts, W.H. ; What the Frog's Eye Tells the Frog's Brain, (PDF, 14 pages) (1959) ; Proceedings of the IRE, Vol. 47, No. 11, pp. 1940-51.
  427. Bishop (2006), Modèle:P.193
  428. Hopfield, J. J. Proc. natn. Acad. Sci. U.S.A. 79, 2554–2558 (1982).
  429. Modèle:Article
  430. 430,0 et 430,1 Appenzeller Tim (2017), The AI revolution in science, Science Niews, 7 juillet
  431. Mohan Mokhtari, Michel Marie 'Applications de MATLAB 5 et SIMULINK 2 : Contrôle de procédés, Logique floue, Réseaux de neurones, Traitement du signal, Springer-Verlag, Paris, 1998 Modèle:ISBN
  432. https://bib.irb.hr/datoteka/244548.Paper_830.pdf
  433. Teuvo Kohonen, Content-addressable Memories, Springer-Verlag, 1987, Modèle:ISBN, 388 pages
  434. Pribram, Karl (1991). Brain and perception: holonomy and structure in figural processing. Hillsdale, N. J.: Lawrence Erlbaum Associates. Modèle:ISBN. quote of « fractal chaos » neural network
  435. D. Levine et al, oscillations in neural systems, publié par Lawrence Erlbaum Associates, 1999, 456 pages, Modèle:ISBN

Bibliographie

  • François Blayo et Michel Verleysen, Les réseaux de neurones artificiels, PUF, Que Sais-je No 3042, Modèle:1re, 1996
  • Léon Personnaz et Isabelle Rivals, Réseaux de neurones formels pour la modélisation, la commande et la classification, CNRS Éditions, 2003.
  • Richard P. Lippman, « An Introduction to Computing with Neural Nets », IEEE ASSP Magazine, avril 1987, p. 4-22
  • Neural Networks : biological computers or electronic brains - Les entretiens de Lyon – (sous la direction de École normale supérieure de Lyon), Springer-Verlag, 1990
  • Jean-Paul Haton, Modèles connexionnistes pour l’intelligence artificielle, 1989.
  • Gérard Dreyfus, Jean-Marc Martinez, Manuel Samuelides, Mirta Gordon, Fouad Badran et Sylvie Thiria, Apprentissage statistique : réseaux de neurones, cartes topologiques, machines à vecteurs supports, Eyrolles, 2008
  • Eric Davalo, Patrick Naïm, Des Réseaux de neurones, Eyrolles, 1990
  • Simon Haykin, Neural Networks : A Comprehensive Foundation, Modèle:2e, Prentice Hall, 1998
  • Christopher M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995
  • Modèle:Bishop06
  • Modèle:DudaHartStork
  • Ben Krose et Patrick van der Smagt, An Introduction to Neural Networks, Modèle:8e éd., 1996
  • Modèle:Ouvrage, téléchargement PDF
  • Marc Parizeau, Réseaux de neurones (Le perceptron multicouche et son algorithme de retropropagation des erreurs), Université Laval, Laval, 2004, 272 p.
  • Fabien Tschirhart (dir. Alain Lioret), Réseaux de neurones formels appliqués à l'Intelligence Artificielle et au jeu, ESGI (mémoire de master de recherche en multimédia et animation numérique), Paris, 2009, 121 p. Modèle:Lire en ligne

Modèle:Palette Modèle:Portail




Category:Artificial neural networks From Wikipedia, the free encyclopedia Jump to navigation Jump to search Artificial neural networks is included in the JEL classification codes as JEL: C45 Wikimedia Commons has media related to Artificial neural network. The main article for this category is Artificial neural networks.

This category are for articles about artificial neural networks (ANN). Subcategories

This category has the following 2 subcategories, out of 2 total. D

   Deep learning‎ (40 P)

N

   Neural network software‎ (27 P)

Pages in category "Artificial neural networks"

The following 156 pages are in this category, out of 156 total. This list may not reflect recent changes (learn more).


   Artificial neural network
   Types of artificial neural networks

A

   Activation function
   ADALINE
   Adaptive neuro fuzzy inference system
   Adaptive resonance theory
   AlexNet
   ALOPEX
   AlterEgo
   Artificial Intelligence System
   Artificial neuron
   Artisto
   Autoassociative memory
   Autoencoder

B

   Backpropagation
   Backpropagation through structure
   Backpropagation through time
   Bcpnn
   Bidirectional associative memory
   Bidirectional recurrent neural networks
   BigDL
   Boltzmann machine

C

   Caffe (software)
   Capsule neural network
   Catastrophic interference
   Cellular neural network
   Cerebellar model articulation controller
   CLEVER score
   CoDi
   Committee machine
   Competitive learning
   Compositional pattern-producing network
   Computational cybernetics
   Computational neurogenetic modeling
   Confabulation (neural networks)
   Connectionist temporal classification
   Convolutional Deep Belief Networks
   Convolutional neural network
   Cover's theorem

D

   Deep belief network
   Deep lambertian networks
   Deep learning
   Deeplearning4j
   Dehaene–Changeux model
   Delta rule
   DexNet
   Differentiable neural computer
   Dropout (neural networks)

E

   Early stopping
   Echo state network
   Electricity price forecasting
   The Emotion Machine
   European Neural Network Society
   Evolutionary acquisition of neural topologies
   Extension neural network
   Extreme learning machine

F

   Feed forward (control)
   Feedforward neural network
   FindFace

G

   Gated recurrent unit
   General regression neural network
   Generalized Hebbian Algorithm
   Generative adversarial network
   Generative topographic map
   Google Neural Machine Translation
   Grossberg network
   Group method of data handling
   Growing self-organizing map

H

   Hard sigmoid
   Helmholtz machine
   Hierarchical temporal memory
   Hopfield network
   Hybrid Kohonen self-organizing map
   Hybrid neural network
   Hyper basis function network
   HyperNEAT

I

   Infomax
   Instantaneously trained neural networks
   Interactive activation and competition networks
   IPO underpricing algorithm

J

   Jpred

L

   Leabra
   Learning rule
   Learning vector quantization
   Lernmatrix
   Linde–Buzo–Gray algorithm
   Liquid state machine
   Long short-term memory

M

   Memtransistor
   Modular neural network
   MoneyBee
   Multi-surface method
   Multilayer perceptron
   Multimodal learning

N

   ND4J (software)
   ND4S
   Neocognitron
   NETtalk (artificial neural network)
   Neural cryptography
   Neural gas
   Neural network software
   Neural network synchronization protocol
   Neural Networks (journal)
   Neural Turing machine
   Neuroevolution
   Neuroevolution of augmenting topologies
   Ni1000
   NVDLA

O

   Oja's rule
   OpenNN
   Optical neural network
   Oscillatory neural network
   Outstar

P

   Perceptron
   Physical neural network
   Probabilistic neural network
   Promoter based genetic algorithm
   Pulse-coupled networks

Q

   Quantum neural network
   Quickprop

R

   Radial basis function
   Radial basis function network
   Random neural network
   Rectifier (neural networks)
   Recurrent neural network
   Recursive neural network
   Relation network
   Reservoir computing
   Residual neural network
   Restricted Boltzmann machine
   Rprop

S

   Self-organizing map
   Semantic neural network
   Sentence embedding
   Siamese network
   Sigmoid function
   Softmax function
   Spiking neural network
   SqueezeNet
   Stochastic neural analog reinforcement calculator
   Stochastic neural network
   Synaptic transistor
   Synaptic weight

T

   Tensor product network
   Time aware long short-term memory
   Time delay neural network
   Triplet loss

U

   U-matrix
   U-Net
   Universal approximation theorem

V

   Vanishing gradient problem

W

   Waifu2x
   WaveNet
   Winner-take-all (computing)
   Word embedding
   Word2

In artificial intelligence, an evolutionary algorithm (EA) is a subset of evolutionary computation,[1] a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.

Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolutionary processes and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor.[2] In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problemsModèle:Citation needed; therefore, there may be no direct link between algorithm complexity and problem complexity.

Implementation

Step One: Generate the initial population of individuals randomly. (First generation)

Step Two: Evaluate the fitness of each individual in that population (time limit, sufficient fitness achieved, etc.)

Step Three: Repeat the following regenerational steps until termination:

  1. Select the best-fit individuals for reproduction. (Parents)
  2. Breed new individuals through crossover and mutation operations to give birth to offspring.
  3. Evaluate the individual fitness of new individuals.
  4. Replace least-fit population with new individuals.

Types

Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem.

  • Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved),[2] by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems. Another name for it is fetura, from the Latin for breeding.[3]
  • Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem.
  • Evolutionary programming – Similar to genetic programming, but the structure of the program is fixed and its numerical parameters are allowed to evolve.
  • Gene expression programming – Like genetic programming, GEP also evolves computer programs but it explores a genotype-phenotype system, where computer programs of different sizes are encoded in linear chromosomes of fixed length.
  • Evolution strategy – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates.
  • Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems.
  • Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect.
  • Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach.

Comparison to biological processes

A possible limitationModèle:According to whom of many evolutionary algorithms is their lack of a clear genotype-phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature phenotype. This indirect encoding is believed to make the genetic search more robust (i.e. reduce the probability of fatal mutations), and also may improve the evolvability of the organism.[4][5] Such indirect (a.k.a. generative or developmental) encodings also enable evolution to exploit the regularity in the environment.[6] Recent work in the field of artificial embryogeny, or artificial developmental systems, seeks to address these concerns. And gene expression programming successfully explores a genotype-phenotype system, where the genotype consists of linear multigenic chromosomes of fixed length and the phenotype consists of multiple expression trees or computer programs of different sizes and shapes.[7] Modèle:Synthesis inline

Related techniques

Swarm algorithmsModèle:Clarify include

  • Ant colony optimization – Based on the ideas of ant foraging by pheromone communication to form paths. Primarily suited for combinatorial optimization and graph problems.
  • The runner-root algorithm (RRA) is inspired by the function of runners and roots of plants in nature[8]
  • Artificial bee colony algorithm – Based on the honey bee foraging behaviour. Primarily proposed for numerical optimization and extended to solve combinatorial, constrained and multi-objective optimization problems.
  • Bees algorithm is based on the foraging behaviour of honey bees. It has been applied in many applications such as routing and scheduling.
  • Cuckoo search is inspired by the brooding parasitism of the cuckoo species. It also uses Lévy flights, and thus it suits for global optimization problems.
  • Electimize optimization - Based on the behavior of electron flow through electric circuit branches with the least electric resistance.[9]
  • Particle swarm optimization – Based on the ideas of animal flocking behaviour. Also primarily suited for numerical optimization problems.

Other population-based metaheuristic methods

  • Hunting Search – A method inspired by the group hunting of some animals such as wolves that organize their position to surround the prey, each of them relative to the position of the others and especially that of their leader. It is a continuous optimization method[10] adapted as a combinatorial optimization method.[11]
  • Adaptive dimensional search – Unlike nature-inspired metaheuristic techniques, an adaptive dimensional search algorithm does not implement any metaphor as an underlying principle. Rather it uses a simple performance-oriented method, based on the update of the search dimensionality ratio (SDR) parameter at each iteration.[12]
  • Firefly algorithm is inspired by the behavior of fireflies, attracting each other by flashing light. This is especially useful for multimodal optimization.
  • Harmony search – Based on the ideas of musicians' behavior in searching for better harmonies. This algorithm is suitable for combinatorial optimization as well as parameter optimization.
  • Gaussian adaptation – Based on information theory. Used for maximization of manufacturing yield, mean fitness or average information. See for instance Entropy in thermodynamics and information theory.
  • Memetic algorithm – A hybrid method, inspired by Richard Dawkins's notion of a meme, it commonly takes the form of a population-based algorithm coupled with individual learning procedures capable of performing local refinements. Emphasizes the exploitation of problem-specific knowledge, and tries to orchestrate local and global search in a synergistic way.
  • Emperor Penguins Colony – A method inspired by the behavior of emperor penguins in their colony. The emperor penguins in the colony seek to create the appropriate heat and regulate their body temperature, and this heat is completely coordinated and controlled by the movement of the penguins.[13]

Examples

The computer simulations Tierra and Avida attempt to model macroevolutionary dynamics.

Gallery

[14] [15] [16]

References

  1. Modèle:Cite article
  2. 2,0 et 2,1 Modèle:Cite book
  3. Jon Roland, Wayward World. Novel that uses fetura to select candidates for public office.
  4. G.S. Hornby and J.B. Pollack. "Creating high-level components with a generative representation for body-brain evolution". Artificial Life, 8(3):223–246, 2002.
  5. Jeff Clune, Benjamin Beckmann, Charles Ofria, and Robert Pennock. "Evolving Coordinated Quadruped Gaits with the HyperNEAT Generative Encoding". Proceedings of the IEEE Congress on Evolutionary Computing Special Section on Evolutionary Robotics, 2009. Trondheim, Norway.
  6. J. Clune, C. Ofria, and R. T. Pennock, "How a generative encoding fares as problem-regularity decreases", in PPSN (G. Rudolph, T. Jansen, S. M. Lucas, C. Poloni, and N. Beume, eds.), vol. 5199 of Lecture Notes in Computer Science, pp. 358–367, Springer, 2008.
  7. Ferreira, C., 2001. "Gene Expression Programming: A New Adaptive Algorithm for Solving Problems". Complex Systems, Vol. 13, issue 2: 87–129.
  8. F. Merrikh-Bayat, "The runner-root algorithm: A metaheuristic for solving unimodal and multimodal optimization problems inspired by runners and roots of plants in nature", Applied Soft Computing, Vol. 33, pp. 292–303, 2015
  9. Modèle:Cite journal
  10. R. Oftadeh et al. (2010), "A novel meta-heuristic optimization algorithm inspired by group hunting of animals: Hunting search", 60, 2087–2098.
  11. Modèle:Cite journal
  12. Hasançebi, O., Kazemzadeh Azad, S. (2015), "Adaptive Dimensional Search: A New Metaheuristic Algorithm for Discrete Truss Sizing Optimization", Computers and Structures, 154, 1–16.
  13. Modèle:Cite journal
  14. Modèle:Cite journal
  15. Modèle:Cite journal
  16. Modèle:Cite book

Bibliography

Attention : la clé de tri par défaut « Evolutionary Algorithm » écrase la précédente clé « Artificial Neural Network ».



Modèle:Short description Noogenesis is the emergence and evolution of intelligence.[1] [2] [3] [4]

Term origin

Noo, nous (Modèle:IPAc-en, Modèle:IPAc-en) – from the ancient Greek Modèle:Lang, has synonyms in other languages Modèle:Lang (Chinese), is a term that currently encompasses the semantics: mind, intelligence, intellect, reason; wisdom; insight, intuition, thought, - in a single phenomenon.[5][6][7]

Noogenesis was first mentioned in the posthumously published in 1955 book The Phenomenon of Man by Pierre Teilhard de Chardin, an anthropologist and philosopher, in a few places: Modèle:Quote The lack of any kind of definition of the term has led to a variety of interpretations reflected in the book,[8][9][10] including "the contemporary period of evolution on Earth, signified by transformation of biosphere onto the sphere of intelligence—noosphere",[11] "evolution run by human mind"[12] etc. The most widespread interpretation is thought to be "the emergence of mind, which follows geogenesis, biogenesis and anthropogenesis, forming a new sphere on Earthnoosphere".

Recent developments

Modern understanding

Fichier:+ Fig 1 The evolution of the reaction speed (eng).jpg
Noogenesis: the evolution of the reaction rate[13] In unicellular organism – the rate of movement of ions through the membrane 10 in -10 degrees m/s, water through the membrane 10 in −6 degree m/s, intracellular liquid (cytoplasm) 2∙10 in −5 degree m/s; Inside multicellular organism – the speed of blood through the vessels ~0.05 m/s, the momentum along the nerve fibers ~100 m/s; In population (humanity) – communications: sound (voice and audio) ~300 km/h, quantum-electron ~3∙10 in 8 degree m/s (the speed of radio-electromagnetic waves, electric current, light, optical, tele-communications).

In 2005 Alexey Eryomin in the monograph Noogenesis and Theory of Intellect[14] proposed a new concept of noogenesis in understanding the evolution of intellectual systems,[15] concepts of intellectual systems, information logistics, information speed, intellectual energy, intellectual potential, consolidated into a theory of the intellect[16] which combines the biophysical parameters of intellectual energy—the amount of information, its acceleration (frequency, speed) and the distance it's being sent—into a formula.[17] According to the new concept—proposed hypothesis continue prognostic progressive evolution of the species Homo sapiens,[18] the analogy between the human brain with the enormous number of neural cells firing at the same time and a similarly functioning human society.[19]

Fichier:Fig 2 The iteration of the number of components in intellectual systems.jpg
Iteration of the number of components in Intellectual systems.[14] A - number of neurons in the brain during individual development (ontogenesis), B - number of people (evolution of populations of humanity), C - number of neurons in the nervous systems of organisms during evolution (phylogenesis).
Fichier:Fig 3 Emergence of info-interaction and evolution of ICT within populations of Humanity.jpg
Emergence and evolution of info-interactions within populations of Humanity[14] Aworld human population → 7 billion; B – number of literate persons; C – number of reading books (with beginning of printing); D – number of receivers (radio, TV); E – number of phones, computers, Internet users

A new understanding of the term "noogenesis" as an evolution of the intellect was proposed by A. Eryomin. A hypothesis based on recapitulation theory links the evolution of the human brain to the development of human civilization. The parallel between the number of people living on Earth and the number of neurons becomes more and more obvious leading us to viewing global intelligence as an analogy for human brain. All of the people living on this planet have undoubtedly inherited the amazing cultural treasures of the past, be it production, social and intellectual ones. We are genetically hardwired to be a sort of "live RAM" of the global intellectual system. Alexey Eryomin suggests that humanity is moving towards a unified self-contained informational and intellectual system. His research has shown the probability of Super Intellect realizing itself as Global Intelligence on Earth. We could get closer to understanding the most profound patterns and laws of the Universe if these kinds of research were given enough attention. Also, the resemblance between the individual human development and such of the whole human race has to be explored further if we are to face some of the threats of the future.[20]

Therefore, generalizing and summarizing: Modèle:Quote

Interdisciplinary nature

The term "noogenesis" can be used in a variety of fields i.e. medicine,[21][22] biophysics,[23] semiotics,[24] mathematics,[25] information technology,[26] psychology, theory of global evolution[27] etc. thus making it a truly cross-disciplinary one. In astrobiology noogenesis concerns the origin of intelligent life and more specifically technological civilizations capable of communicating with humans and or traveling to Earth.[28] The lack of evidence for the existence of such extraterrestrial life creates the Fermi paradox.

Aspects of emergence and evolution of mind

To the parameters of the phenomenon "noo", "intellectus"

The emergence of the human mind is considered to be one of the five fundamental phenomenons of emergent evolution.[29] To understand the mind, it is necessary to determine how human thinking differs from other thinking beings. Such differences include the ability to generate calculations, to combine dissimilar concepts, to use mental symbols, and to think abstractly.[30] The knowledge of the phenomenon of intelligent systems—the emergence of reason (noogenesis) boils down to:

Several published works which do not employ the term "noogenesis", however, address some patterns in the emergence and functioning of the human intelligence: working memory capacity ≥ 7,[31] ability to predict, prognosis,[32] hierarchical (6 layers neurons) system of information analysis,[33] consciousness,[34] memory,[35] generated and consumed information properties[36] etc. They also set the limits of several physiological aspects of human intelligence.[37] Сonception of emergence of insight.[38]

Aspects of evolution "sapiens"

Historical evolutionary development[39] and emergence of H. sapiens as species,[40] include emergence of such concepts as anthropogenesis, phylogenesis, morphogenesis, cephalization,[41] systemogenesis,[42] cognition systems autonomy.[43]

On the other hand, development of an individual's intellect deals with concepts of embryogenesis, ontogenesis,[44] morphogenesis, neurogenesis,[45] higher nervous function of I.P.Pavlov and his philosophy of mind.[46] Despite the fact that the morphofunctional maturity is usually reached by the age of 13, the definitive functioning of the brain structures is not complete until about 16–17 years of age.[47]

The future of intelligence

Modèle:Further The fields of Bioinformatics, genetic engineering, noopharmacology, cognitive load, brain stimulations, the efficient use of altered states of consciousness, use of non-human cognition, information technology (IT), artificial intelligence (AI) are all believed to be effective methods of intelligence advancement and may be the future of intelligence on earth and the galaxy.[20][48]

Issues and further research prospects

The development of the human brain, perception, cognition, memory and neuroplasticity are unsolved problems in neuroscience. Several megaprojects are being carried out in: American BRAIN Initiative, European Human Brain Project, China Brain Project, Blue Brain Project, Allen Brain Atlas, Human Connectome Project, Google Brain, - in attempt to better our understanding of the brain's functionality along with the intention to develop human cognitive performance in the future with artificial intelligence, informational, communication and cognitive technology.[49]

See also

Modèle:Portal Modèle:Columns-list

References

Modèle:Reflist

Modèle:Neuroscience Modèle:Physiology types Modèle:Navboxes

Attention : la clé de tri par défaut « Noogenesis » écrase la précédente clé « Evolutionary Algorithm ».




Modèle:Distinguish

Neuroevolution, or neuro-evolution, is a form of artificial intelligence that uses evolutionary algorithms to generate artificial neural networks (ANN), parameters, topology and rules.[50] It is most commonly applied in artificial life, general game playing[51] and evolutionary robotics. The main benefit is that neuroevolution can be applied more widely than supervised learning algorithms, which require a syllabus of correct input-output pairs. In contrast, neuroevolution requires only a measure of a network's performance at a task. For example, the outcome of a game (i.e. whether one player won or lost) can be easily measured without providing labeled examples of desired strategies. Neuroevolution can be contrasted with conventional deep learning techniques that use gradient descent on a neural network with a fixed topology.

Features

Many neuroevolution algorithms have been defined. One common distinction is between algorithms that evolve only the strength of the connection weights for a fixed network topology (sometimes called conventional neuroevolution), as opposed to those that evolve both the topology of the network and its weights (called TWEANNs, for Topology and Weight Evolving Artificial Neural Network algorithms).

A separate distinction can be made between methods that evolve the structure of ANNs in parallel to its parameters (those applying standard evolutionary algorithms) and those that develop them separately (through memetic algorithms).[52]

Comparison with gradient descent

Most neural networks use gradient descent rather than neuroevolution. However, around 2017 researchers at Uber stated they had found that simple structural neuroevolution algorithms were competitive with sophisticated modern industry-standard gradient-descent deep learning algorithms, in part because neuroevolution was found to be less likely to get stuck in local minima. In Science, journalist Matthew Hutson speculated that part of the reason neuroevolution is succeeding where it had failed before is due to the increased computational power available in the 2010s.[53]

Direct and indirect encoding

Evolutionary algorithms operate on a population of genotypes (also referred to as genomes). In neuroevolution, a genotype is mapped to a neural network phenotype that is evaluated on some task to derive its fitness.

In direct encoding schemes the genotype directly maps to the phenotype. That is, every neuron and connection in the neural network is specified directly and explicitly in the genotype. In contrast, in indirect encoding schemes the genotype specifies indirectly how that network should be generated.[54]

Indirect encodings are often used to achieve several aims:[54][55][56][57][58]

  • modularity and other regularities;
  • compression of phenotype to a smaller genotype, providing a smaller search space;
  • mapping the search space (genome) to the problem domain.

Taxonomy of embryogenic systems for indirect encoding

Traditionally indirect encodings that employ artificial embryogeny (also known as artificial development) have been categorised along the lines of a grammatical approach versus a cell chemistry approach.[59] The former evolves sets of rules in the form of grammatical rewrite systems. The latter attempts to mimic how physical structures emerge in biology through gene expression. Indirect encoding systems often use aspects of both approaches.

Stanley and Miikkulainen[59] propose a taxonomy for embryogenic systems that is intended to reflect their underlying properties. The taxonomy identifies five continuous dimensions, along which any embryogenic system can be placed:

  • Cell (neuron) fate: the final characteristics and role of the cell in the mature phenotype. This dimension counts the number of methods used for determining the fate of a cell.
  • Targeting: the method by which connections are directed from source cells to target cells. This ranges from specific targeting (source and target are explicitly identified) to relative targeting (e.g. based on locations of cells relative to each other).
  • Heterochrony: the timing and ordering of events during embryogeny. Counts the number of mechanisms for changing the timing of events.
  • Canalization: how tolerant the genome is to mutations (brittleness). Ranges from requiring precise genotypic instructions to a high tolerance of imprecise mutation.
  • Complexification: the ability of the system (including evolutionary algorithm and genotype to phenotype mapping) to allow complexification of the genome (and hence phenotype) over time. Ranges from allowing only fixed-size genomes to allowing highly variable length genomes.

Examples

Examples of neuroevolution methods (those with direct encodings are necessarily non-embryogenic):

Method Encoding Evolutionary algorithm Aspects evolved
Neuro-genetic evolution by E. Ronald, 1994[60] Direct Genetic algorithm Network Weights
Cellular Encoding (CE) by F. Gruau, 1994[56] Indirect, embryogenic (grammar tree using S-expressions) Genetic programming Structure and parameters (simultaneous, complexification)
GNARL by Angeline et al., 1994[61] Direct Evolutionary programming Structure and parameters (simultaneous, complexification)
EPNet by Yao and Liu, 1997[62] Direct Evolutionary programming (combined with backpropagation and simulated annealing) Structure and parameters (mixed, complexification and simplification)
NeuroEvolution of Augmenting Topologies (NEAT) by Stanley and Miikkulainen, 2002[32][63] Direct Genetic algorithm. Tracks genes with historical markings to allow crossover between different topologies, protects innovation via speciation. Structure and parameters
Hypercube-based NeuroEvolution of Augmenting Topologies (HyperNEAT) by Stanley, D'Ambrosio, Gauci, 2008[55] Indirect, non-embryogenic (spatial patterns generated by a Compositional pattern-producing network (CPPN) within a hypercube are interpreted as connectivity patterns in a lower-dimensional space) Genetic algorithm. The NEAT algorithm (above) is used to evolve the CPPN. Parameters, structure fixed (functionally fully connected)
Evolvable Substrate Hypercube-based NeuroEvolution of Augmenting Topologies (ES-HyperNEAT) by Risi, Stanley 2012[58] Indirect, non-embryogenic (spatial patterns generated by a Compositional pattern-producing network (CPPN) within a hypercube are interpreted as connectivity patterns in a lower-dimensional space) Genetic algorithm. The NEAT algorithm (above) is used to evolve the CPPN. Parameters and network structure
Evolutionary Acquisition of Neural Topologies (EANT/EANT2) by Kassahun and Sommer, 2005[64] / Siebel and Sommer, 2007[65] Direct and indirect, potentially embryogenic (Common Genetic Encoding[54]) Evolutionary programming/Evolution strategies Structure and parameters (separately, complexification)
Interactively Constrained Neuro-Evolution (ICONE) by Rempis, 2012[66] Direct, includes constraint masks to restrict the search to specific topology / parameter manifolds. Evolutionary algorithm. Uses constraint masks to drastically reduce the search space through exploiting domain knowledge. Structure and parameters (separately, complexification, interactive)
Deus Ex Neural Network (DXNN) by Gene Sher, 2012[67] Direct/Indirect, includes constraints, local tuning, and allows for evolution to integrate new sensors and actuators. Memetic algorithm. Evolves network structure and parameters on different time-scales. Structure and parameters (separately, complexification, interactive)
Spectrum-diverse Unified Neuroevolution Architecture (SUNA) by Danilo Vasconcellos Vargas, Junichi Murata[68] (Download code) Direct, introduces the Unified Neural Representation (representation integrating most of the neural network features from the literature). Genetic Algorithm with a diversity preserving mechanism called Spectrum-diversity that scales well with chromosome size, is problem independent and focus more on obtaining diversity of high level behaviours/approaches. To achieve this diversity the concept of chromosome Spectrum is introduced and used together with a Novelty Map Population. Structure and parameters (mixed, complexification and simplification)
Modular Agent-Based Evolver (MABE) by Clifford Bohm, Arend Hintze, and others.[69] (Download code) Direct or indirect encoding of Markov networks, Neural Networks, genetic programming, and other arbitrarily customizable controllers. Provides evolutionary algorithms, genetic programming algorithms, and allows customized algorithms, along with specification of arbitrary constraints. Evolvable aspects include the neural model and allows for the evolution of morphology and sexual selection among others.
Covariance Matrix Adaptation with Hypervolume Sorted Adaptive Grid Algorithm (CMA-HAGA) by Shahin Rostami, and others.,[70][71] Direct, includes an atavism feature which enables traits to disappear and re-appear at different generations. Multi-Objective Evolution Strategy with Preference Articulation Structure, weights, and biases.

See also

References

Modèle:Reflist

External links

  • Modèle:Cite web
  • Modèle:Cite web</ref> (has downloadable papers on NEAT and applications)
  • Modèle:Cite web mature Open Source neuroevolution project implemented in C#/.Net.
  • ANNEvolve is an Open Source AI Research Project (Downloadable source code in C and Python with a tutorial & miscellaneous writings and illustrations
  • Modèle:Cite web</ref> Web page on evolutionary learning with EANT/EANT2] (information and articles on EANT/EANT2 with applications to robot learning)
  • NERD Toolkit. The Neurodynamics and Evolutionary Robotics Development Toolkit. A free, open source software collection for various experiments on neurocontrol and neuroevolution. Includes a scriptable simulator, several neuro-evolution algorithms (e.g. ICONE), cluster support, visual network design and analysis tools.
  • Modèle:Cite web Source code for the DXNN Neuroevolutionary system.
  • Modèle:Cite web

Modèle:Neuroscience



Erreur de référence : Des balises <ref> existent, mais aucune balise <references/> n’a été trouvée.