🔵 🔵 🔵


Primary

၊၊||၊|။

Transformer ○ᴹᴸ|Definition|1st|20260628123146-00-⌔

Transformer (deep learning) - Wikipedia

Transformer (deep learning)

🖼️ ➺

In deep learning, the transformer is a family of artificial neural network architectures based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table.1 At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Because self-attention alone is permutation-invariant, transformers inject positional information, typically through positional encodings or learned positional embeddings, so token order can affect the output.2

Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM).3 Later variations have been widely adopted for training large language models (LLMs) on large (language) datasets.4 Modern transformer designs are commonly grouped into encoder-only, decoder-only, and encoder-decoder variants, depending on whether they are optimized for representation learning, autoregressive generation, or conditional sequence-to-sequence tasks.5

The original version of the transformer architecture was proposed in the 2017 paper “Attention Is All You Need” by researchers at Google.1 The predecessors of transformers were developed as an improvement over previous architectures for machine translation,67 but have found many applications since. They are used in large-scale natural language processing, computer vision (vision transformers), reinforcement learning,89 audio,10 multimodal learning, robotics,11 and playing chess.12 It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)13 and BERT14 (bidirectional encoder representations from transformers).

Printed 2026-06-28.

(echo:: @ )

Footnotes

  1. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). “Attention is All you Need” (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. Archived (PDF) from the original on 2024-02-21. Retrieved 2023-10-31. 2

  2. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017). “Attention Is All You Need” (PDF). Advances in Neural Information Processing Systems. Archived (PDF) from the original on 2024-02-21. Retrieved 2026-05-05.

  3. Hochreiter, Sepp; Schmidhuber, Jürgen (November 1997). “Long Short-Term Memory”. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.

  4. “Better Language Models and Their Implications”. OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.

  5. Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2019-10-23). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. arXiv:1910.10683 [cs.LG].

  6. Bahdanau; Cho, Kyunghyun; Bengio, Yoshua (September 1, 2014). “Neural Machine Translation by Jointly Learning to Align and Translate”. arXiv:1409.0473 [cs.CL].

  7. Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (August 17, 2015). “Effective Approaches to Attention-based Neural Machine Translation”. arXiv:1508.04025 [cs.CL].

  8. Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021-06-24), Decision Transformer: Reinforcement Learning via Sequence Modeling, arXiv:2106.01345

  9. Parisotto, Emilio; Song, Francis; Rae, Jack; Pascanu, Razvan; Gulcehre, Caglar; Jayakumar, Siddhant; Jaderberg, Max; Kaufman, Raphaël Lopez; Clark, Aidan; Noury, Seb; Botvinick, Matthew; Heess, Nicolas; Hadsell, Raia (2020-11-21). “Stabilizing Transformers for Reinforcement Learning”. Proceedings of the 37th International Conference on Machine Learning. PMLR: 7487–7498. Archived from the original on 2024-08-09. Retrieved 2024-08-09.

  10. Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). “Robust Speech Recognition via Large-Scale Weak Supervision”. arXiv:2212.04356 [eess.AS].

  11. Monastirsky, Maxim; Azulay, Osher; Sintov, Avishai (February 2023). “Learning to Throw With a Handful of Samples Using Decision Transformers”. IEEE Robotics and Automation Letters. 8 (2): 576–583. Bibcode:2023IRAL…8..576M. doi:10.1109/LRA.2022.3229266.

  12. Ruoss, Anian; Delétang, Grégoire; Medapati, Sourabh; Grau-Moya, Jordi; Wenliang, Li; Catt, Elliot; Reid, John; Genewein, Tim (2024-02-07). “Grandmaster-Level Chess Without Search”. arXiv:2402.04494v1 [cs.LG].

  13. Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). “Transformers: State-of-the-Art Natural Language Processing”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6.

  14. “Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing”. Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25.

Link to original

Secondary

• • •