Wide neural networks and neural tangent kernels

A solid theoretical understanding of deep neural networks is lacking at the moment. This is because, although the building blocks of neural networks are simple, their collective behavior is very complex. Mathematically, this complexity is captured by a non-linear, time-evolving object called the neural tangent kernel, which enters into the training dynamics.

The number of neurons in each layer of the neural network is called the “width” of that layer. When considering the limit in which the width of all hidden layers goes to infinity, the neural network simplifies dramatically. By an argument using the central limit theorem, one can show that in the infinite width limit, the neurons follow a collective Gaussian distribution known as a Gaussian process. Intuitively, the fluctuations from all the neurons cancel out. This effect was known for a long time for simple cases and characterizes the neural network at initialization, i.e. before training has begun. Furthermore, the neural tangent kernel becomes independent of the specific initialization chosen and can be computed using recursive layer-by-layer expressions.

In 2018, a seminal paper by Jacot et al. showed that the simplifications go even further than that: In the infinite width limit, the dynamics remain manageable even during training. Specifically, they proved that the neural tangent kernel of an infinitely wide network remains constant throughout training. Combined with the earlier results, this means that the neural tangent kernel is analytically accessible at any point in training time.

Under the simple but realistic training paradigm of gradient descent of the mean-squared-error loss, the training dynamics can in fact be solved analytically in closed form and the prediction of the trained network on any input be computed. The output of the trained network is again a Gaussian process.

These simplifications in the infinite width limit give powerful insides into the behavior of neural networks. At the same time, the neural tangent kernel can be used as a kernel in kernel machines known from traditional machine learning.

However, realistic neural networks are described by these computations only approximately. To make statements about networks at finite width, one can consider deviations from the strict infinite width limit. Mathematically, the behavior of wide neural networks is similar to the behavior of elementary particles as described by quantum field theory. In this context, the infinite width limit corresponds to particles which do not interact and deviations are introduced by letting the particles interact with each other.

Therefore, by using advanced techniques from quantum field theory, one can analyze the behavior of neural networks. In particular, it is possible to extrapolate from infinitely-wide networks and make approximate statements about networks with finitely many neurons as encountered in practice.

In this research program, we use techniques from theoretical physics to analyze the behavior of wide neural networks theoretically and to understand and improve their architecture, training and initialization.

Relevant publications

Equivariant Neural Tangent Kernels
2024
Philipp Misof, Pan Kessel, Jan E. Gerken

Equivariant neural networks have in recent years become an important technique for guiding architecture selection for neural networks with many applications in domains ranging from medical image analysis to quantum chemistry. In particular, as the most general linear equivariant layers with respect to the regular representation, group convolutions have been highly impactful in numerous applications. Although equivariant architectures have been studied extensively, much less is known about the training dynamics of equivariant neural networks. Concurrently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically understand the training dynamics of wide neural networks. In this work, we combine these two fields for the first time by giving explicit expressions for NTKs of group convolutional neural networks. In numerical experiments, we demonstrate superior performance for equivariant NTKs over non-equivariant NTKs on a classification task for medical images.

Preprint: arXiv
NTK ENN
Emergent Equivariance in Deep Ensembles
2024
Jan E. Gerken, Pan Kessel

We demonstrate that deep ensembles are secretly equivariant models. More precisely, we show that deep ensembles become equivariant for all inputs and at all training times by simply using data augmentation. Crucially, equivariance holds off-manifold and for any architecture in the infinite width limit. The equivariance is emergent in the sense that predictions of individual ensemble members are not equivariant but their collective prediction is. Neural tangent kernel theory is used to derive this result and we verify our theoretical insights using detailed numerical experiments.

Published: ICML 2024 (Oral)
Preprint: arXiv
NTK ENN