Transformers — Brief Review

Buse Köseoğlu
5 min readOct 24, 2023

--

Created with DALL-E

The term ‘Transformer’ was first introduced in the paper ‘Attention is All You Need,’ published in 2017. Artificial intelligence models like ChatGPT, which are widely used today, are powered by Transformers. Instead of examining words one by one, Transformers analyze words in a sequence to capture the connections between them. Moreover, it is the first model to calculate input and output representations entirely using self-attention, without using sequential RNNs or convolutions. In this article, I will briefly discuss the basic building blocks of Transformers.

Word Embedding

Transformers are a type of artificial neural network, so they take specific inputs at the input layer. Of course, these inputs need to be converted from words to numbers. There are many ways to do this, but the most commonly used method for artificial neural networks is Word Embedding (word vectors). These vectors are used to help Transformers better understand textual data. The primary goal of word embedding is to have an input for each word and symbol in the vocabulary we want to use. Words and symbols in the vocabulary are referred to as tokens.

Positional Encoding

Even if the same words are used in sentences, the order in which words are used can change the meaning of the sentence. Transformers use the Positional Encoding technique to keep track of the order of words.

Encoder-Decoder

The Transformer model consists of two fundamental components: Encoder and Decoder. These two components are especially used for processing and generating texts in text-based tasks.

  • Encoder: The Encoder is composed of a stack of N = 6 identical layers. It processes the input text (sentence/prompt). It passes this input through a series of embedding layers to create vectors. Then, it performs a recursive operation. This is one of the most important features of Transformers. Each layer takes the output from the previous layer and uses it to create higher-level representations. The goal is to extract the essence of the input and discover its features. Each layer focuses on different parts of the input using attention mechanisms. The attention mechanism looks at other words to determine the importance of a word.
  • Decoder: Like the Encoder, the Decoder consists of a stack of N = 6 identical layers. It uses the representation received from the Encoder to generate the output text. First, the Decoder starts with a “start token” and then constructs the text step by step. The Decoder is also multi-layered, and each layer predicts the next word using the output from previous steps and the representation from the Encoder. Attention mechanisms determine which words the Decoder should focus on at each step. The Decoder completes the output by specifying an “end token” at the end.

You can see the transformer architecture below

Let’s examine this structure with an example. Let’s say we want to translate the sentence “Ben bir öğrenciyim” from Turkish to English. In this case, a seq2seq structure, where the encoder and decoder work together, is used. Google also uses this structure for translation. We can visualize this structure as follows: Imagine two people, one whose native language is Turkish (encoder) and the other’s is English (decoder). They have a common language between them, which is Italian. To translate from Turkish to English, first, the sentence is translated from Turkish to Italian (context), and then from Italian to English. You can see an example of this in the figure below.

Types of Models

The encoder and decoder components in the Transformer structure can be used together or separately, depending on the task at hand. Here are different types of models:

  • Encoder Only Models: These models can be used for some classification tasks where the input and output lengths are the same. Use cases include Sentiment Analysis, Named Entity Recognition, Word Classification, etc. An example of this is the BERT model.
  • Decoder Only Models: Currently, these are the most widely used model types and have become more generalized for various tasks. They are used in tasks such as Text Generation, among others. Models like GPT, Bloom, and LLaMA are examples of this category.
  • Encoder-Decoder Models: These models work well in Seq2seq tasks and tasks where the input and output lengths differ. They are commonly used in tasks like Translation, Summarization, Question Answering, etc. Examples of these models include BART and T5.

Attention Mechanism

The concept of attention is essential when working with Transformers. It is used in order to facilitate understanding of meaning. Self-attention works by considering how similar each word is to every other word and calculates the similarities between words. For example, in the sentence:

- “The pizza came out of the oven and it tasted good.”

When the word “The” is selected, the similarity between this word and other words, as well as its self-similarity, is calculated. This is done for every word in the sentence. After calculating these similarities, they are used to determine how the transformer encodes each word. For instance, in the sentence above, the relationship between “it” and “pizza” will be stronger than the relationship between “it” and “oven.” Therefore, the similarity score with “pizza” will have a greater impact on the transformer.

Training and use of transformers requires a significant amount of computational resources. In this respect, it can be said that it is expensive. Additionally, they need to be trained with large data sets to be effective.
Despite everything, the transformers that shape NLP continue to make people’s jobs easier in many areas.

References

VASWANI, Ashish, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.

GPT-3 Building Innovative NLP Products Usins LLMs

https://serokell.io/blog/transformers-in-ml

--

--