Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).

  • It is used for pattern matching (e.g. searching and replacing in text)

Formally, a regular expression is an algebraic notation for characterising a set of strings.

  • Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.

 

Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.

  • /s/ matches a lower case s but not an upper case S.
  • We can solve this problem with the use of the square braces [ and ].
    → The string of characters inside the braces specifies a disjunction of characters to match.
    → e.g. the pattern /[sS]/ matches patterns containing either s or S.
    → e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)

 

Regular expression provides a very flexible for doing these transformations:

  1. Disjunctions
    : e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set"
  2. Negation
    : e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9
  3. Optionality
    : e.g. r"beg.n" # . means match anything (wildcard expression)
              → "begin began begun beginning" returns to "X X X Xning"
    : e.g. r"colou?r" # ? means previous character is optional
    : e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
              → Greedy matching: it searches for a pattern that begins with any word character (w)
                   and then grabs everything (.*) after it, essentially deleting the entire line.
    : e.g. r"w.*?" # make sure the match is non-greedy using the ? character
    : e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
    → In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible.
  4. Aliases
    - \d: any digit → [0-9]
    - \D: any non-digit → [^0-9]
    - \w: any alphanumeric/underscore → [a-zA-Z0-9_]
    - \W: a non-alphanumeric → [^\w]
    - \s: whitespace (space, tab) → [ \r\t\n\f]
    - \S: non-whitespace → [^\s]
  5. Anchors
    : e.g. re.sub('\w+', "", text)) # delete all words
    : e.g. re.sub( '^\w+' , "", text, flags=re.MULTILINE))
              # delete only words at the start of a string
              # switch on multiline mode to delete words at the start each line

    : e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string

Operator precedence

  1. Parenthesis: ()
  2. Counters: * + ? {}
  3. Sequences and anchors: the ^my end&
  4. Disjunction: |

Thus,

  • Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
  • Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.

 

Stop words

  • High-frequency words, little lexical content (e.g. and, the, of)
  • When used for certain ML tasks can add 'noise' (but not always)
  • Filter out beforehand
  • Language specific, no universal list

Text corpora

  1. Gutenberg corpus (from nltk.corpus import gutenberg)
    : Electronic text archive, containing free electric books → Isolated
  2. Web & chat (from nltk.corpus import nps_chat)
    : Samples of less formal langauge
  3. Brown corpus (from nltk.corpus import brown)
    : Text from 500 sources, categorised by genre → Categorised
  4. Reuters corpus (from nltk.corpus import reuters)
    : 10,788 news docs, tagged with various topics/industries etc. → Overlapping
  5. Inaugural address corpus (from nltk.corpus import inaugural)
    : 55 texts, one for each US presidential address → Temporal

 

Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

  • Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

A simple scalar example of audio processing:

  • Amplitude is on the y-axis.
  • Time is on the x-axis.

Normalisation in audio signals allows us to adjust the volume (amplitude) of the entire signal.

  • We can change the size of the amplitude in a proportionate way.

Normalisation in audio signals is a bit simpler than statistical normalisation. It involves two phases: analysis and scaling.

  1. Analysis phase
    : In this phase, the signal is analysed to find the peak, or the loudest sample. This is essentially a peak-finding algorithm that identifies the highest amplitude in the waveform.
  2. Scaling phase
    : Once the peak is found, the algorithm calculates how much gain can be applied to the entire signal without causing clipping (distortion). This gain is then applied uniformly to the entire signal.

Linear ramps: fading in and out

  • Fade in
    : It starts with the scalar zero so that mutes the signal and then gradually as we go through that range, we're increasing the scalar up to one when it hits which would make no change to the signal so effectively turns back to the original signal.
  • Fade out
    : It starts out with a high scalar at the beginning of the array of numbers that we're going to process. As the effect, as we've iterated over the numbers in the array, we will reduce that scalar down to zero and then obviously that would sound like the signal getting quieter.

 

'IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w04) Filtering  (0) 2024.05.01
(w01) Digitising audio signals  (0) 2024.04.11
(w01) Audio fundamentals  (0) 2024.04.11

Text is often referred to as unstructured data.

  • Text has plenty of structure, but it is linguistic structure as it is intended for human consumption, not for computers.

 

Text may contain synonyms (multiple words with the same meaning) and homographs (one spelling shared among multiple words with different meanings).

 

People write ungrammatically, misspell words, run words together, abbreviate unpredictably, and punctuate randomly. Because text is intended for communication between people, context is important.

 

The general strategy in text mining is to use the simplest technique that works.

  • A document is composed of individual tokens or terms.
  • A collection of documents is called a corpus.

 

Language is ambiguous. To determine structure, we must resolve ambiguity.

  • Processing text data:
    • Lexical analysis (tokenisation)
      : Tokenisation is the task of chopping it up into pieces, called tokens.
    • Stop word removal
      : A stopword is a very common word in English. The words the, and, of and on are considered stopwords so they are typically removed.
    • Stemming
      : Suffixes are removed so that verbs like announces, announced and announcing are all reduced to the term accounc. Stemming also transforms noun plurals to the singular forms, so directors becomes director.
    • Lemmatisation
      : A lemma is the canonical form, dictionary form, or citation form of a set of word forms. For example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
    • Morphology (prefixes, suffixes, etc.)
      : Morphology is the study of words, including the principles by which they are formed, and how they relate to one another within a language.
  • The higher levels of ambiguity:
    • Syntax (part of speech tagging)
      • Ambiguity problem
    • Parsing (grammar)
    • Sentence boundary detection

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

NLP tasks can be categorised by problem type:

  • Classification
    • Sentiment classification
    • News categorisation
  • Regression
    • Essay scoring
  • Sequence labelling
    • Part of speech tagging, named entity recognition

How to evaluate the models? Here is an example for a classification problem below:

  • Imagine we are building a spam classifier
    : Predict whether email messages will be filtered or not
    • Input = feature matrix (email message)
    • Output = target vector (yes/no)
  • Model could be Naive Bayes, k-nearest neighbour, etc.
  • This is a binary classification problem

In the case, the goal is to predict 'spam' or 'not spam' for email messages. As before:

  1. Choose a class of model
  2. Set model hyperparameters
  3. Configure the data (X and y)
  4. Fit the model to the data
  5. Apply model to new (unseen) data

To measure performance, we should consider several factors, including

  • Metric(s)
    • These are quantitative measures that assess how well a model performs.
      → A common metric is accuracy, which is calculated as the number of correct predictions divided by the total number of predictions (n).
  • Balance of the dataset
    • This refers to the distribution of classes within your data.
      → An imbalanced dataset can skew the performance metrics, so it's important to consider this factor as well (for an unbalanced dataset, we can achieve high accuracy simply by selecting the majority class).

Another example for a classification problem:

  • Imagine you work in a hospital
    : Predict whether a CT scan shows tumour or not
    • Tumours are rare events, so the classes are unbalanced
      : The cost of missing a tumour is much higher than a 'false alarm'
  • Accuracy is not a good metric

In the case, the confusion matrix can be used to compare the predicted values with actual values (ground truth):

  Predicted Actual
True Positive (TP) Positive Positive
False Positive (FP) Positive Negative
False Negative (FN) Negative Positive
True Negative (TN) Negative Negative

 

Confusion Matrix Actual Values
Positive Negative
Predicted
Values
Positive TP FP
Negative FN TN

 

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Recall = TP / (TP + FN)
    : Recall is the proportion of actual positive values that are predicted positive
  • Precision = TP / (TP + FP)
    : Precision is the proportion of predicted positive values that are actually positive

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w01) NLP applications  (0) 2024.04.17

+ Recent posts