Spherical computer vision and equivariant transformers

Transformer networks are a type of neural network architecture that have been widely used in natural language processing (NLP) tasks such as machine translation and language modeling. They were introduced in the paper “Attention is all you need” by Vaswani et al. One key feature of transformer networks is their use of self-attention mechanisms, which allow the network to selectively weight the importance of different input elements when processing them. This allows transformer networks to capture long-range dependencies and relationships within a sequence of data, such as in a sentence or paragraph of text. Another notable feature of transformer networks is their ability to process input data in parallel, using multi-headed attention mechanisms. This allows them to achieve high computational efficiency and fast training times, making them particularly well-suited for large-scale NLP tasks. Overall, transformer networks have achieved state-of-the-art results on a wide range of NLP tasks and have become a key tool for researchers and practitioners in the field. In particular, transformers form the foundation for the algorithm behind Open AI’s immensely popular chat bot “chatGPT”.

Though very successful for language-related tasks, transformers are not ideal for dealing with images due to their linear structure. However, in 2020 the “vision transformer” (ViT) was introduced. It was designed to a transformer for computer vision applications, rather than for language processing. The main idea is to split an image into patches, from which lower-dimensional embeddings are produced. This results in a sequence that can be fed into a standard transformer encoder. Despite its promise the ViT suffered from several shortcomings. In particular, they struggle with high resolution images since since its computational complexity is “quadratic” in its image size. The “swin transformer” resolves these issues, representing an improved variant of the ViT with superior performance. One key feature of Swin transformers is their use of local self-attention mechanisms, which allow the network to selectively weight the importance of different spatial regions within an image when processing it. This enables the network to capture fine-grained details and spatial relationships within the image, which is important for tasks such as image recognition. Overall, swin transformers offer a promising approach for image recognition tasks and have the potential to find wide-ranging applications in fields such as computer vision and robotics.

As we have discussed above, for image recognition, equivariance implies that you can recognize an object in an image even if its appearance or position changes. A standard transformer, on the other hand, is by definition “permutation invariant”. However it cannot recognize grid structured data. In order to implement some kind of more general group equivariance it seems natural to take the Swin transformer as a starting point. The Swin transformer introduces a division of an image into a set of “windows” and each window contains a (fixed) number of patches. One of the key advantages is the use of window based self attention, in which the attention of each patch is computed only with respect to the other patches in the same window. This is in contrast to the ViT which computes the self attention of every patch with respect to every other patch. Since the window size is fixed throughout the network, this results in linear computational complexity with respect to the number of patches. This is an amazing improvement with respect to the ViT. The name “Swin” comes from “shifting windows”, which is the mechanism that introduces cross-window connections: each window is shifted by a certain factor toward the bottom right corner of the image. This has been found to significantly improve the performance of the network.

On the theoretical side of this project, we explore and develop the mathematical foundations of “equivariant transformers”. We are currently focussing on the developing an equivariant version of the Swin transformer. There are strong analogies between the Swin transformer and the differential geometric structure of manifolds. As we have mentioned above, Swin transformers treats images as consisting of grids of patches organized in terms of windows that can shift across the image. This structure therefore appears very amenable for generalizations to Swin transformer on manifolds, to wit, a “gauge equivariant Swin transformer”. This can be viewed as a long term theoretical goal of our this research project.

Relevant publications

HEAL-SWIN: A Vision Transformer On The Sphere
2023
Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spieß, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson

High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, resulting in a one-dimensional representation of the spherical data with minimal computational overhead. We demonstrate the superior performance of our model for semantic segmentation and depth regression tasks on both synthetic and real automotive datasets. Our code is available at https://github.com/JanEGerken/HEAL-SWIN.

Published: CVPR 2024
Preprint: arXiv
SCV
Equivariance versus Augmentation for Spherical Images
2022
Jan E. Gerken, Oscar Carlsson, Hampus Linander, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson

We analyze the role of rotational equivariance in convolutional neural networks (CNNs) applied to spherical images. We compare the performance of the group equivariant networks known as S2CNNs and standard non-equivariant CNNs trained with an increasing amount of data augmentation. The chosen architectures can be considered baseline references for the respective design paradigms. Our models are trained and evaluated on single or multiple items from the MNIST- or FashionMNIST dataset projected onto the sphere. For the task of image classification, which is inherently rotationally invariant, we find that by considerably increasing the amount of data augmentation and the size of the networks, it is possible for the standard CNNs to reach at least the same performance as the equivariant network. In contrast, for the inherently equivariant task of semantic segmentation, the non-equivariant networks are consistently outperformed by the equivariant networks with significantly fewer parameters. We also analyze and compare the inference latency and training times of the different networks, enabling detailed tradeoff considerations between equivariant architectures and data augmentation for practical problems.

Published: ICML 2022
Preprint: arXiv
SCV ENN