Residual neural network

Canonical form of a residual neural network. A layer ℓ − 1 is skipped over activation from ℓ − 2.

A residual neural network (ResNet) is an artificial neural network (ANN) of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this by utilizing skip connections, or shortcuts to jump over some layers. Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities (ReLU) and batch normalization in between.^[1] An additional weight matrix may be used to learn the skip weights; these models are known as HighwayNets.^[2] Models with several parallel skips are referred to as DenseNets.^[3] In the context of residual neural networks, a non-residual network may be described as a plain network.

A reconstruction of a pyramidal cell. Soma and dendrites are labeled in red, axon arbor in blue. (1) Soma, (2) Basal dendrite, (3) Apical dendrite, (4) Axon, (5) Collateral axon.

There are two main reasons to add skip connections: to avoid the problem of vanishing gradients, or to mitigate the Degradation (accuracy saturation) problem; where adding more layers to a suitably deep model leads to higher training error.^[1] During training, the weights adapt to mute the upstream layer^{[clarification needed]}, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a HighwayNet should be used).

Skipping effectively simplifies the network, using fewer layers in the initial training stages^{[clarification needed]}. This speeds learning by reducing the impact of vanishing gradients, as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the feature space. Towards the end of training, when all layers are expanded, it stays closer to the manifold^{[clarification needed]} and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover.

Biological analogue[]

The brain has structures similar to residual nets, as cortical layer VI neurons get input from layer I, skipping intermediary layers.^[4] In the figure this compares to signals from the apical dendrite (3) skipping over layers, while the basal dendrite (2) collects signals from the previous and/or same layer.^{[note 1]}^[5] Similar structures exists for other layers.^[6] How many layers in the cerebral cortex compare to layers in an artificial neural network is not clear, nor whether every area in the cerebral cortex exhibits the same structure, but over large areas they appear similar. There is no evidence of anything like backpropagation taking place in the brain and existence of global “teaching signal” or iterative optimization in animal brain is not supported by neurophysiological evidence.

Forward propagation[]

Given a weight matrix ${\textstyle W^{\ell -1,\ell }}$ for connection weights from layer ${\textstyle \ell -1}$ to ${\textstyle \ell }$ , and a weight matrix ${\textstyle W^{\ell -2,\ell }}$ for connection weights from layer ${\textstyle \ell -2}$ to ${\textstyle \ell }$ , then the forward propagation through the activation function would be (aka HighwayNets)

{\begin{aligned}a^{\ell }&:=\mathbf {g} (W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\\&:=\mathbf {g} (Z^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\end{aligned}}

where

{\textstyle a^{\ell }}

the activations (outputs) of neurons in layer

{\textstyle \ell }

,

{\textstyle \mathbf {g} }

the activation function for layer

{\textstyle \ell }

,

{\textstyle W^{\ell -1,\ell }}

the weight matrix for neurons between layer

{\textstyle \ell -1}

and

{\textstyle \ell }

, and

{\textstyle Z^{\ell }=W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }}

Absent an explicit matrix ${\textstyle W^{\ell -2,\ell }}$ (aka ResNets), forward propagation through the activation function simplifies to

a^{\ell }:=\mathbf {g} (Z^{\ell }+a^{\ell -2})

Another way to formulate this is to substitute an identity matrix for ${\textstyle W^{\ell -2,\ell }}$ , but that is only valid when the dimensions match. This is somewhat confusingly called an identity block, which means that the activations from layer ${\textstyle \ell -2}$ are passed to layer ${\textstyle \ell }$ without weighting.

In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka DenseNets)

a^{\ell }:=\mathbf {g} \left(Z^{\ell }+\sum _{k=2}^{K}W^{\ell -k,\ell }\cdot a^{\ell -k}\right)

.

Backward propagation[]

During backpropagation learning for the normal path

\Delta w^{\ell -1,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -1,\ell }}}=-\eta a^{\ell -1}\cdot \delta ^{\ell }

and for the skip paths (nearly identical)

\Delta w^{\ell -2,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -2,\ell }}}=-\eta a^{\ell -2}\cdot \delta ^{\ell }

.

In both cases

{\textstyle \eta }

a learning rate (

{\textstyle \eta <0)}

,

{\textstyle \delta ^{\ell }}

the error signal of neurons at layer

{\textstyle \ell }

, and

{\textstyle a_{i}^{\ell }}

the activation of neurons at layer

{\textstyle \ell }

.

If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule.

In the general case there can be ${\textstyle K}$ skip path weight matrices, thus

\Delta w^{\ell -k,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -k,\ell }}}=-\eta a^{\ell -k}\cdot \delta ^{\ell }

As the learning rules are similar, the weight matrices can be merged and learned in the same step.

Notes[]

^ Some research indicates that there are additional structures here, so this explanation is somewhat simplified.

References[]

^ ^a ^b He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].
^ Huang, Gao; Liu, Zhuang; Van Der Maaten, Laurens; Weinberger, Kilian Q. (2017). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE. pp. 2261–2269. arXiv:1608.06993. doi:10.1109/CVPR.2017.243. ISBN 978-1-5386-0457-1.
^ Thomson, AM (2010). "Neocortical layer 6, a review". Frontiers in Neuroanatomy. 4: 13. doi:10.3389/fnana.2010.00013. PMC 2885865. PMID 20556241.
^ Winterer, Jochen; Maier, Nikolaus; Wozny, Christian; Beed, Prateep; Breustedt, Jörg; Evangelista, Roberta; Peng, Yangfan; D’Albis, Tiziano; Kempter, Richard (2017). "Excitatory Microcircuits within Superficial Layers of the Medial Entorhinal Cortex". Cell Reports. 19 (6): 1110–1116. doi:10.1016/j.celrep.2017.04.041. PMID 28494861.
^ Fitzpatrick, David (1996-05-01). "The Functional Organization of Local Circuits in Visual Cortex: Insights from the Study of Tree Shrew Striate Cortex". Cerebral Cortex. 6 (3): 329–341. doi:10.1093/cercor/6.3.329. ISSN 1047-3211. PMID 8670661.

[5] Some research indicates that there are additional structures here, so this explanation is somewhat simplified.

[:0-1] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

[2] Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].

[3] Huang, Gao; Liu, Zhuang; Van Der Maaten, Laurens; Weinberger, Kilian Q. (2017). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE. pp. 2261–2269. arXiv:1608.06993. doi:10.1109/CVPR.2017.243. ISBN 978-1-5386-0457-1.

[4] Thomson, AM (2010). "Neocortical layer 6, a review". Frontiers in Neuroanatomy. 4: 13. doi:10.3389/fnana.2010.00013. PMC 2885865. PMID 20556241.

[6] Winterer, Jochen; Maier, Nikolaus; Wozny, Christian; Beed, Prateep; Breustedt, Jörg; Evangelista, Roberta; Peng, Yangfan; D’Albis, Tiziano; Kempter, Richard (2017). "Excitatory Microcircuits within Superficial Layers of the Medial Entorhinal Cortex". Cell Reports. 19 (6): 1110–1116. doi:10.1016/j.celrep.2017.04.041. PMID 28494861.

[7] Fitzpatrick, David (1996-05-01). "The Functional Organization of Local Circuits in Visual Cortex: Insights from the Study of Tree Shrew Striate Cortex". Cerebral Cortex. 6 (3): 329–341. doi:10.1093/cercor/6.3.329. ISSN 1047-3211. PMID 8670661.

[1]

[2]

[3]

[4]

[note 1]

[5]

[6]