Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).

  • It is used for pattern matching (e.g. searching and replacing in text)

Formally, a regular expression is an algebraic notation for characterising a set of strings.

  • Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.

 

Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.

  • /s/ matches a lower case s but not an upper case S.
  • We can solve this problem with the use of the square braces [ and ].
    → The string of characters inside the braces specifies a disjunction of characters to match.
    → e.g. the pattern /[sS]/ matches patterns containing either s or S.
    → e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)

 

Regular expression provides a very flexible for doing these transformations:

  1. Disjunctions
    : e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set"
  2. Negation
    : e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9
  3. Optionality
    : e.g. r"beg.n" # . means match anything (wildcard expression)
              → "begin began begun beginning" returns to "X X X Xning"
    : e.g. r"colou?r" # ? means previous character is optional
    : e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
              → Greedy matching: it searches for a pattern that begins with any word character (w)
                   and then grabs everything (.*) after it, essentially deleting the entire line.
    : e.g. r"w.*?" # make sure the match is non-greedy using the ? character
    : e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
    → In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible.
  4. Aliases
    - \d: any digit → [0-9]
    - \D: any non-digit → [^0-9]
    - \w: any alphanumeric/underscore → [a-zA-Z0-9_]
    - \W: a non-alphanumeric → [^\w]
    - \s: whitespace (space, tab) → [ \r\t\n\f]
    - \S: non-whitespace → [^\s]
  5. Anchors
    : e.g. re.sub('\w+', "", text)) # delete all words
    : e.g. re.sub( '^\w+' , "", text, flasg=re.MULTILINE))
              # delete only words at the start of a string
              # switch on multiline mode to delete words at the start each line

    : e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string

Operator precedence

  1. Parenthesis: ()
  2. Counters: * + ? {}
  3. Sequences and anchors: the ^my end&
  4. Disjunction: |

Thus,

  • Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
  • Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.

 

Stop words

  • High-frequency words, little lexical content (e.g. and, the, of)
  • When used for certain ML tasks can add 'noise' (but not always)
  • Filter out beforehand
  • Language specific, no universal list

Text corpora

  1. Gutenberg corpus (from nltk.corpus import gutenberg)
    : Electronic text archive, containing free electric books → Isolated
  2. Web & chat (from nltk.corpus import nps_chat)
    : Samples of less formal langauge
  3. Brown corpus (from nltk.corpus import brown)
    : Text from 500 sources, categorised by genre → Categorised
  4. Reuters corpus (from nltk.corpus import reuters)
    : 10,788 news docs, tagged with various topics/industries etc. → Overlapping
  5. Inaugural address corpus (from nltk.corpus import inaugural)
    : 55 texts, one for each US presidential address → Temporal

 

Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

  • Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

+ Recent posts