lexical analysis, lexing or tokenisation is the process of converting a sequence of characters (such as in a computer programme or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).

 

A programme that performs lexical analysis may be termed a lexer, tokeniser, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyse the syntax of programming languages, web pages, and so forth.

 

Lexical Analysis can be implemented with the deterministic finite automata.

2021.08.27 - [CS_Theory] - [Regular Languages] Determinism (DFA)

 

[Regular Languages] Determinism (DFA)

 

sorapark.tistory.com

The output is a sequence of tokens, which is sent to the parser for syntax analysis.

 

  Read characters   Token  
Input
Lexical Analyser
Syntax Analyser
  Push back extra characters   Ask for token  

 

Lexeme

  • lexeme is a sequence of characters in the source programme that matches the pattern for a token and is identified by the lexical analyser as an instance of that token.
  • Some authors term this a "token", using "token" interchangeably to represent the string being tokenised, and the token data structure resulting from putting this string through the tokenisation process.

Token

  • A lexical token or simply token is a string with an assigned and thus identified meaning.
  • It is structured as a pair consisting of a token name and an optional token value.

The token name is a category of lexical unit. Common token names are:

  1. identifier: names the programmer chooses
    ex) x, colour and UP
  2. keyword: names already in the programming language
    ex) ifwhile and return
  3. separator (also known as punctuators): punctuation characters and paired-delimiters
    ex) }, ( and ;
  4. operator: symbols that operate on arguments and produce results
    ex) +, < and =
  5. literal: numeric, logical, textual, reference literals
    ex) true6.02e23 and "music"
  6. comment: line, block (depends on the compiler if compiler implements comments as tokens otherwise it will be stripped)
    ex) /* */ and //

For example, 

x = a + b * 2;

All the valid tokens are:

  • x
  • =
  • a
  • +
  • b
  • *
  • 2
  • ;

 

The lexical analysis of the expression yields the following sequence of tokens:

[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]

 

+ Recent posts