Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).

  • It is used for pattern matching (e.g. searching and replacing in text)

Formally, a regular expression is an algebraic notation for characterising a set of strings.

  • Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.

 

Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.

  • /s/ matches a lower case s but not an upper case S.
  • We can solve this problem with the use of the square braces [ and ].
    → The string of characters inside the braces specifies a disjunction of characters to match.
    → e.g. the pattern /[sS]/ matches patterns containing either s or S.
    → e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)

 

Regular expression provides a very flexible for doing these transformations:

  1. Disjunctions
    : e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set"
  2. Negation
    : e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9
  3. Optionality
    : e.g. r"beg.n" # . means match anything (wildcard expression)
              → "begin began begun beginning" returns to "X X X Xning"
    : e.g. r"colou?r" # ? means previous character is optional
    : e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
              → Greedy matching: it searches for a pattern that begins with any word character (w)
                   and then grabs everything (.*) after it, essentially deleting the entire line.
    : e.g. r"w.*?" # make sure the match is non-greedy using the ? character
    : e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
    → In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible.
  4. Aliases
    - \d: any digit → [0-9]
    - \D: any non-digit → [^0-9]
    - \w: any alphanumeric/underscore → [a-zA-Z0-9_]
    - \W: a non-alphanumeric → [^\w]
    - \s: whitespace (space, tab) → [ \r\t\n\f]
    - \S: non-whitespace → [^\s]
  5. Anchors
    : e.g. re.sub('\w+', "", text)) # delete all words
    : e.g. re.sub( '^\w+' , "", text, flasg=re.MULTILINE))
              # delete only words at the start of a string
              # switch on multiline mode to delete words at the start each line

    : e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string

Operator precedence

  1. Parenthesis: ()
  2. Counters: * + ? {}
  3. Sequences and anchors: the ^my end&
  4. Disjunction: |

Thus,

  • Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
  • Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.

 

Stop words

  • High-frequency words, little lexical content (e.g. and, the, of)
  • When used for certain ML tasks can add 'noise' (but not always)
  • Filter out beforehand
  • Language specific, no universal list

Text corpora

  1. Gutenberg corpus (from nltk.corpus import gutenberg)
    : Electronic text archive, containing free electric books → Isolated
  2. Web & chat (from nltk.corpus import nps_chat)
    : Samples of less formal langauge
  3. Brown corpus (from nltk.corpus import brown)
    : Text from 500 sources, categorised by genre → Categorised
  4. Reuters corpus (from nltk.corpus import reuters)
    : 10,788 news docs, tagged with various topics/industries etc. → Overlapping
  5. Inaugural address corpus (from nltk.corpus import inaugural)
    : 55 texts, one for each US presidential address → Temporal

 

Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

  • Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

Text is often referred to as unstructured data.

  • Text has plenty of structure, but it is linguistic structure as it is intended for human consumption, not for computers.

 

Text may contain synonyms (multiple words with the same meaning) and homographs (one spelling shared among multiple words with different meanings).

 

People write ungrammatically, misspell words, run words together, abbreviate unpredictably, and punctuate randomly.

 

Because text is intended for communication between people, context is important.

 

The general strategy in text mining is to use the simplest technique that works.

  • A document is composed of individual tokens or terms.
  • A collection of documents is called a corpus.

 

Language is ambiguous. To determine structure, we must resolve ambiguity.

  • Processing text data:
    • Lexical analysis (tokenisation)
      : Tokenisation is the task of chopping it up into pieces, called tokens.
    • Stop word removal
      : A stopword is a very common word in English. The words the, and, of and on are considered stopwords so they are typically removed.
    • Stemming
      : Suffixes are removed so that verbs like announces, announced and announcing are all reduced to the term accounc. Stemming also transforms noun plurals to the singular forms, so directors becomes director.
    • Lemmatisation
      : A lemma is the canonical form, dictionary form, or citation form of a set of word forms. For example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
    • Morphology (prefixes, suffixes, etc.)
  • The higher levels of ambiguity:
    • Syntax (part of speech tagging)
      • Ambiguity problem
    • Parsing (grammar)
    • Sentence boundary detection

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

+ Recent posts