Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).
- It is used for pattern matching (e.g. searching and replacing in text)
Formally, a regular expression is an algebraic notation for characterising a set of strings.
- Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.
The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.
Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.
- /s/ matches a lower case s but not an upper case S.
- We can solve this problem with the use of the square braces [ and ].
→ The string of characters inside the braces specifies a disjunction of characters to match.
→ e.g. the pattern /[sS]/ matches patterns containing either s or S.
→ e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)
Regular expression provides a very flexible for doing these transformations:
- Disjunctions
: e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set" - Negation
: e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9 - Optionality
: e.g. r"beg.n" # . means match anything (wildcard expression)
→ "begin began begun beginning" returns to "X X X Xning"
: e.g. r"colou?r" # ? means previous character is optional
: e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
→ Greedy matching: it searches for a pattern that begins with any word character (w)
and then grabs everything (.*) after it, essentially deleting the entire line.
: e.g. r"w.*?" # make sure the match is non-greedy using the ? character
: e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
→ In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible. - Aliases
- \d: any digit → [0-9]
- \D: any non-digit → [^0-9]
- \w: any alphanumeric/underscore → [a-zA-Z0-9_]
- \W: a non-alphanumeric → [^\w]
- \s: whitespace (space, tab) → [ \r\t\n\f]
- \S: non-whitespace → [^\s] - Anchors
: e.g. re.sub('\w+', "", text)) # delete all words
: e.g. re.sub( '^\w+' , "", text, flags=re.MULTILINE))
# delete only words at the start of a string
# switch on multiline mode to delete words at the start each line
: e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string
Operator precedence
- Parenthesis: ()
- Counters: * + ? {}
- Sequences and anchors: the ^my end&
- Disjunction: |
Thus,
- Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
- Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.
Stop words
- High-frequency words, little lexical content (e.g. and, the, of)
- When used for certain ML tasks can add 'noise' (but not always)
- Filter out beforehand
- Language specific, no universal list
Text corpora
- Gutenberg corpus (from nltk.corpus import gutenberg)
: Electronic text archive, containing free electric books → Isolated - Web & chat (from nltk.corpus import nps_chat)
: Samples of less formal langauge - Brown corpus (from nltk.corpus import brown)
: Text from 500 sources, categorised by genre → Categorised - Reuters corpus (from nltk.corpus import reuters)
: 10,788 news docs, tagged with various topics/industries etc. → Overlapping - Inaugural address corpus (from nltk.corpus import inaugural)
: 55 texts, one for each US presidential address → Temporal
Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.
- Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.
'NaturalLanguageProcessing > Concept' 카테고리의 다른 글
(w07) Lexical semantics (0) | 2024.05.22 |
---|---|
(w06) N-gram Language Models (0) | 2024.05.14 |
(w03) Text processing fundamentals (0) | 2024.04.24 |
(w02) NLP evaluation -basic (0) | 2024.04.17 |
(w01) NLP applications (0) | 2024.04.17 |