Speaker: Doroteo Torre Toledano.
Abstract: Since Transformers were proposed in 2017, they have dominated the state-of-the-art in several domains including language modelling, speech processing, and even image processing. Although the main ideas of the original Transformers are essentially kept, there have been many improvements on the original Transformer architecture. Transformers are based on self-attention implemented as a scaled dot-product attention, which is an attention mechanism that can be described as content-based attention. This means that the attention weights, and therefore the output of the self-attention layers, depend only on the input embeddings and not on the positions of the embeddings. Given that the position of the embeddings (for instance the order or the words in natural language) is normally very important, several ways of injecting positional information have been proposed. The original Transformer architecture proposed adding sinusoidal position encoding vectors to the input embeddings, which resulted in absolute positional encoding. Later other approaches have tried to add relative positional encodings, sometimes using learned positional encodings. In this talk we will review the different methods proposed to inject position information in Transformer architectures and will present one of the latest and more successful method, Rotary Position Encoding (RoPE), which is currently used in modern LLMs. This approach multiplies the content embeddings with the positional embeddings, which can be interpreted as a rotation of the original embeddings with an angle relative to the position. In this way, the attention weights depend naturally on only the relative positions between the query and the key. RoPE has also interesting properties, such as long-term decay of the attention weights with the distance, and a simple integration with optimizations of the Transformers such as the Linear Transformers.