Text is often referred to as unstructured data.

  • Text has plenty of structure, but it is linguistic structure as it is intended for human consumption, not for computers.

 

Text may contain synonyms (multiple words with the same meaning) and homographs (one spelling shared among multiple words with different meanings).

 

People write ungrammatically, misspell words, run words together, abbreviate unpredictably, and punctuate randomly.

 

Because text is intended for communication between people, context is important.

 

The general strategy in text mining is to use the simplest technique that works.

  • A document is composed of individual tokens or terms.
  • A collection of documents is called a corpus.

 

Language is ambiguous. To determine structure, we must resolve ambiguity.

  • Processing text data:
    • Lexical analysis (tokenisation)
      : Tokenisation is the task of chopping it up into pieces, called tokens.
    • Stop word removal
      : A stopword is a very common word in English. The words the, and, of and on are considered stopwords so they are typically removed.
    • Stemming
      : Suffixes are removed so that verbs like announces, announced and announcing are all reduced to the term accounc. Stemming also transforms noun plurals to the singular forms, so directors becomes director.
    • Lemmatisation
      : A lemma is the canonical form, dictionary form, or citation form of a set of word forms. For example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
    • Morphology (prefixes, suffixes, etc.)
  • The higher levels of ambiguity:
    • Syntax (part of speech tagging)
      • Ambiguity problem
    • Parsing (grammar)
    • Sentence boundary detection

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

lexical analysis, lexing or tokenisation is the process of converting a sequence of characters (such as in a computer programme or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).

 

A programme that performs lexical analysis may be termed a lexer, tokeniser, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyse the syntax of programming languages, web pages, and so forth.

 

Lexical Analysis can be implemented with the deterministic finite automata.

2021.08.27 - [CS_Theory] - [Regular Languages] Determinism (DFA)

 

[Regular Languages] Determinism (DFA)

 

sorapark.tistory.com

The output is a sequence of tokens, which is sent to the parser for syntax analysis.

 

  Read characters   Token  
Input
Lexical Analyser
Syntax Analyser
  Push back extra characters   Ask for token  

 

Lexeme

  • lexeme is a sequence of characters in the source programme that matches the pattern for a token and is identified by the lexical analyser as an instance of that token.
  • Some authors term this a "token", using "token" interchangeably to represent the string being tokenised, and the token data structure resulting from putting this string through the tokenisation process.

Token

  • A lexical token or simply token is a string with an assigned and thus identified meaning.
  • It is structured as a pair consisting of a token name and an optional token value.

The token name is a category of lexical unit. Common token names are:

  1. identifier: names the programmer chooses
    ex) x, colour and UP
  2. keyword: names already in the programming language
    ex) ifwhile and return
  3. separator (also known as punctuators): punctuation characters and paired-delimiters
    ex) }, ( and ;
  4. operator: symbols that operate on arguments and produce results
    ex) +, < and =
  5. literal: numeric, logical, textual, reference literals
    ex) true6.02e23 and "music"
  6. comment: line, block (depends on the compiler if compiler implements comments as tokens otherwise it will be stripped)
    ex) /* */ and //

For example, 

x = a + b * 2;

All the valid tokens are:

  • x
  • =
  • a
  • +
  • b
  • *
  • 2
  • ;

 

The lexical analysis of the expression yields the following sequence of tokens:

[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]

 

+ Recent posts