lexical analysis, lexing or tokenisation is the process of converting a sequence of characters (such as in a computer programme or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).
A programme that performs lexical analysis may be termed a lexer, tokeniser, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyse the syntax of programming languages, web pages, and so forth.
Lexical Analysis can be implemented with the deterministic finite automata.
2021.08.27 - [CS_Theory] - [Regular Languages] Determinism (DFA)
The output is a sequence of tokens, which is sent to the parser for syntax analysis.
Read characters | Token | |||
Input | → ← |
Lexical Analyser | → ← |
Syntax Analyser |
Push back extra characters | Ask for token |
Lexeme
- A lexeme is a sequence of characters in the source programme that matches the pattern for a token and is identified by the lexical analyser as an instance of that token.
- Some authors term this a "token", using "token" interchangeably to represent the string being tokenised, and the token data structure resulting from putting this string through the tokenisation process.
Token
- A lexical token or simply token is a string with an assigned and thus identified meaning.
- It is structured as a pair consisting of a token name and an optional token value.
The token name is a category of lexical unit. Common token names are:
- identifier: names the programmer chooses
ex) x, colour and UP - keyword: names already in the programming language
ex) if, while and return - separator (also known as punctuators): punctuation characters and paired-delimiters
ex) }, ( and ; - operator: symbols that operate on arguments and produce results
ex) +, < and = - literal: numeric, logical, textual, reference literals
ex) true, 6.02e23 and "music" - comment: line, block (depends on the compiler if compiler implements comments as tokens otherwise it will be stripped)
ex) /* */ and //
For example,
x = a + b * 2;
All the valid tokens are:
- x
- =
- a
- +
- b
- *
- 2
- ;
The lexical analysis of the expression yields the following sequence of tokens:
[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]