'aliase' 태그의 글 목록

aliase

(w04) Regular expression 2024.04.30

(w04) Regular expression

welcometosorapark 2024. 4. 30. 01:21

2024. 4. 30. 01:21

Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).

It is used for pattern matching (e.g. searching and replacing in text)

Formally, a regular expression is an algebraic notation for characterising a set of strings.

Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.

Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.

/s/ matches a lower case s but not an upper case S.
We can solve this problem with the use of the square braces [ and ].
→ The string of characters inside the braces specifies a disjunction of characters to match.
→ e.g. the pattern /[sS]/ matches patterns containing either s or S.
→ e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)

Regular expression provides a very flexible for doing these transformations:

Disjunctions
: e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set"
Negation
: e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9
Optionality
: e.g. r"beg.n" # . means match anything (wildcard expression)
→ "begin began begun beginning" returns to "X X X Xning"
: e.g. r"colou?r" # ? means previous character is optional
: e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
→ Greedy matching: it searches for a pattern that begins with any word character (w)
and then grabs everything (.*) after it, essentially deleting the entire line.
: e.g. r"w.*?" # make sure the match is non-greedy using the ? character
: e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
→ In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible.
Aliases
- \d: any digit → [0-9]
- \D: any non-digit → [^0-9]
- \w: any alphanumeric/underscore → [a-zA-Z0-9_]
- \W: a non-alphanumeric → [^\w]
- \s: whitespace (space, tab) → [ \r\t\n\f]
- \S: non-whitespace → [^\s]
Anchors
: e.g. re.sub('\w+', "", text)) # delete all words
: e.g. re.sub( '^\w+' , "", text, flags=re.MULTILINE))
# delete only words at the start of a string
# switch on multiline mode to delete words at the start each line
: e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string

Operator precedence

Parenthesis: ()
Counters: * + ? {}
Sequences and anchors: the ^my end&
Disjunction: |

Thus,

Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.

Stop words

High-frequency words, little lexical content (e.g. and, the, of)
When used for certain ML tasks can add 'noise' (but not always)
Filter out beforehand
Language specific, no universal list

Text corpora

Gutenberg corpus (from nltk.corpus import gutenberg)
: Electronic text archive, containing free electric books → Isolated
Web & chat (from nltk.corpus import nps_chat)
: Samples of less formal langauge
Brown corpus (from nltk.corpus import brown)
: Text from 500 sources, categorised by genre → Categorised
Reuters corpus (from nltk.corpus import reuters)
: 10,788 news docs, tagged with various topics/industries etc. → Overlapping
Inaugural address corpus (from nltk.corpus import inaugural)
: 55 texts, one for each US presidential address → Temporal

Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics (0)	2024.05.22
(w06) N-gram Language Models (0)	2024.05.14
(w03) Text processing fundamentals (0)	2024.04.24
(w02) NLP evaluation -basic (0)	2024.04.17
(w01) NLP applications (0)	2024.04.17

PREV 이전 1 NEXT 다음

Welcome To Sora Park

aliase

(w04) Regular expression

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

+ Recent posts

티스토리툴바