Primary
❪֎₆❫ Tokens ○|Definition|1st|20251021004113-00-⌔
Lexical analysis - Wikipedia#Token
A lexical token is a string with an assigned and thus identified meaning, in contrast to the probabilistic token used in large language models. A lexical token consists of a token name and an optional token value. The token name is a category of a rule-based lexical unit.1
Consider this expression in the C programming language:
- ✤
x = a + b ﹡ 2;The lexical analysis of this expression yields the following sequence of tokens:
- ✤
[(identifier, 'x'), (operator, '='), (identifier, 'a'), (operator, '+'), (identifier, 'b'), (operator, '﹡'), (literal, '2'), (separator, ';')]A token name is what might be termed a part of speech in linguistics.
Lexical tokenization is the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a “lexer” program, such as identifiers, operators, grouping symbols, and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
For example, in the text string:
- ✤
The quick brown fox jumps over the lazy dogthe string is not implicitly segmented on spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string
""or regular expression/\s{1}/).When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in semantic analysis. The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers.
Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include regular expressions, specific sequences of characters termed a flag, specific separating characters called delimiters, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each ”(” is matched with a ”)”.
When a lexer feeds tokens to the parser, the representation used is typically an enumerated type, which is a list of number representations. For example, “Identifier” can be represented with 0, “Assignment operator” with 1, “Addition operator” with 2, etc.
Tokens are often defined by regular expressions, which are understood by a lexical analyzer generator such as lex, or handcoded equivalent finite-state automata. The lexical analyzer (generated automatically by a tool like lex or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is termed tokenizing. If the lexer finds an invalid token, it will report an error.
Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.
Printed 2026-06-28.
(echo:: @ ᯤ)
Link to original Footnotes
page 111, “Compilers Principles, Techniques, & Tools, 2nd Ed.” (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme ↩
Secondary
• • •