Primary

၊၊||၊|။

Transformer ○ᴹᴸ｜Definition｜1st｜20260628123146-00-⌔
Transformer (deep learning) - Wikipedia

Transformer (deep learning)

🖼️ ➺

In deep learning, the transformer is a family of artificial neural network architectures based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table.¹ At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Because self-attention alone is permutation-invariant, transformers inject positional information, typically through positional encodings or learned positional embeddings, so token order can affect the output.²

Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM).³ Later variations have been widely adopted for training large language models (LLMs) on large (language) datasets.⁴ Modern transformer designs are commonly grouped into encoder-only, decoder-only, and encoder-decoder variants, depending on whether they are optimized for representation learning, autoregressive generation, or conditional sequence-to-sequence tasks.⁵

The original version of the transformer architecture was proposed in the 2017 paper “Attention Is All You Need” by researchers at Google.¹ The predecessors of transformers were developed as an improvement over previous architectures for machine translation,⁶⁷ but have found many applications since. They are used in large-scale natural language processing, computer vision (vision transformers), reinforcement learning,⁸⁹ audio,¹⁰ multimodal learning, robotics,¹¹ and playing chess.¹² It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)¹³ and BERT¹⁴ (bidirectional encoder representations from transformers).

Printed 2026-06-28.

(echo:: @ ᯤ)

Footnotes

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). “Attention is All you Need” (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. Archived (PDF) from the original on 2024-02-21. Retrieved 2023-10-31. ↩ ↩²

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017). “Attention Is All You Need” (PDF). Advances in Neural Information Processing Systems. Archived (PDF) from the original on 2024-02-21. Retrieved 2026-05-05. ↩

Hochreiter, Sepp; Schmidhuber, Jürgen (November 1997). “Long Short-Term Memory”. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. ↩

“Better Language Models and Their Implications”. OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25. ↩

Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2019-10-23). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. arXiv:1910.10683 [cs.LG]. ↩

Bahdanau; Cho, Kyunghyun; Bengio, Yoshua (September 1, 2014). “Neural Machine Translation by Jointly Learning to Align and Translate”. arXiv:1409.0473 [cs.CL]. ↩

Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (August 17, 2015). “Effective Approaches to Attention-based Neural Machine Translation”. arXiv:1508.04025 [cs.CL]. ↩

Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021-06-24), Decision Transformer: Reinforcement Learning via Sequence Modeling, arXiv:2106.01345 ↩

Parisotto, Emilio; Song, Francis; Rae, Jack; Pascanu, Razvan; Gulcehre, Caglar; Jayakumar, Siddhant; Jaderberg, Max; Kaufman, Raphaël Lopez; Clark, Aidan; Noury, Seb; Botvinick, Matthew; Heess, Nicolas; Hadsell, Raia (2020-11-21). “Stabilizing Transformers for Reinforcement Learning”. Proceedings of the 37th International Conference on Machine Learning. PMLR: 7487–7498. Archived from the original on 2024-08-09. Retrieved 2024-08-09. ↩

Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). “Robust Speech Recognition via Large-Scale Weak Supervision”. arXiv:2212.04356 [eess.AS]. ↩

Monastirsky, Maxim; Azulay, Osher; Sintov, Avishai (February 2023). “Learning to Throw With a Handful of Samples Using Decision Transformers”. IEEE Robotics and Automation Letters. 8 (2): 576–583. Bibcode:2023IRAL…8..576M. doi:10.1109/LRA.2022.3229266. ↩

Ruoss, Anian; Delétang, Grégoire; Medapati, Sourabh; Grau-Moya, Jordi; Wenliang, Li; Catt, Elliot; Reid, John; Genewein, Tim (2024-02-07). “Grandmaster-Level Chess Without Search”. arXiv:2402.04494v1 [cs.LG]. ↩

Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). “Transformers: State-of-the-Art Natural Language Processing”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. ↩

“Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing”. Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25. ↩

Link to original

Secondary

• • •

⏾ Concept Map

Transformer ○ᴹᴸ Entries

Primary

Transformer ○ᴹᴸ｜Definition｜1st｜20260628123146-00-⌔

Transformer (deep learning)

Secondary