Words with similar contexts have similar meanings

  • Zellig Harris (1954): 'If A and B have almost identical environments, we say that they are synonyms' (e.g. doctor|surgeon (patient, hospital, treatment, etc)
    → This notion is referred to as the distributional hypothesis

Distributional hypothesis

  • Is concerned with the link between similar distributions and similar meanings
  • Assumes words with similar contexts have similar meanings

 

Distributional models are based on a co-occurrence matrix

  • Term-document matrix
  • Term-term matrix

Overall matrix is |V| by |D|

  • Similar documents have similar words: represented by the column vectors
  • Similar words occur in similar documents: represented by the row vectors

Overall matrix is |V| by |V|

  • Represents co-occurrences in some corpus
  • Context is usually restricted to a fixed window (e.g. +/- 4 words)
  • Similar terms have similar vectors

Problems with term-term matrices:

  • Term-term matrices are sparse
    • Term vectors are long |V|
    • Most entries are zero

Doesn't reflect underlying linguistic structure: 'food is bad' and 'meal was awful'

 

Word embeddings

  • Represent words as low-dimensional vectors
  • Capture semantic and syntactic similarities between words (e.g. food|meal, bad|awful, etc)
  • Typically have 50-300 dimensions rather than |V| (compared to the vast vocabulary size)
  • Most vector elements are non-zero

Benefits

  • Classifiers need to learn far fewer weights
    → significantly reduce the number of parameters for classifiers
  • Improve generalisation and prevent overfitting
  • Capture semantic relationships like synonymy

Word2Vec software package: Static embeddings (unlike BERT or ELMo)

  • Key idea
    • Predict rather than count
    • Binary prediction task: 'Is word x likely to co-occur with word y?'
    • Use the learned classifier weights
    • Running text is the training data
  • Basic algorithm (skip-gram with negative sampling)
    • Treat neighbouring context words as positive samples
    • Treat other random words in the vocabulary (V) as negative samples
      → negative samples are randomly selected words that did not appear in the context window
    • Train a logistic regression classifier to distinguish these classes
    • Use learned weights as embeddings

The benefits of using word embeddings compared to traditional vector representations:

  1. They often better generalisation capabilities
  2. They are more compact

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

NLP tasks:

  1. Classification taks (e.g. spam detection)
  2. Sequence tasks (e.g. text generation)
  3. Meaning tasks

A lexical database:

  • Nodes are synsets
  • Correspond to abstract concepts
  • Ployhierarchical structure
    • A polyhierarchical structure is one that allows multiple parents.

According to WordNet which is a large lexical database of English, a synset or synonym set is defined as a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

 

Using WordNet, we can programmatically:

  • Identify hyponyms (child terms) and hypernyms (parent terms)
  • Measure semantic similarity

The process of classifying words into their parts of speech and labelling them accordingly is known as parts-of-speech tagging, POS-tagging, or simply tagging. Parts-of-speech are also known as word classes or lexical categories.

  • POS-tagger processes a sequence of words, and attaches a part of speech tag to each word.
  • The collection of tags used for a particular task is known as a tagset.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w08) Vector semantics and embeddings  (0) 2024.05.27
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

Models that assign probabilities to upcoming words, or sequences of words in general, are called language models (LMs).

  • Predicting upcoming words turns out that the large language models that revolutionised modern NLP are trained simply by predicting words.

Why does it matter what the probability of a sentence is or how probable the next word is?

  • In many NLP applications, we can use the probability as a way to choose a better sentence or word over a less-appropriate one.

Language models can also help in augmentative and alternative communication (AAC) systems
→ augmentative means to add to someone's speech
→ alternative means to be used instead of speech

  • People often use such AAC devices if they are physically unable to speak or sign but can instead use eye gaze or other specific movements to select words from a menu.
  • Word prediction can be used to suggest likely words for the menu.

N-gram language model

  • An n-gram is a sequence of n words:
    • A 2-gram (bigram) is a two-word sequence of words like "please turn", "turn your", or "your homework".
    • A 3-gram (trigram) is a three word sequence of words like "please turn your", or "turn your homework".
  • The word 'n-gram' is also used to mean a probabilistic model that can estimate the probability of a word given the n-1 previous words, and thereby also to assign probabilities to entire sequences.

Suppose the history h is "its water is so transparent that" and we want to know the probability that the next word is "the":

P(the | its water is so transparent that)

One way to estimate this probability is from relative frequency counts:

  • Take a very large corpus, count the number of times we see "its water is so transparent that", and count the number of times this is followed by "the'.

Out of the times we saw the history h, how many times was it followed by the word w, as follows:

P(the | its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that)

 

To represent the probability of a particular random variable

$$ X_i $$

taking on the value "the" or

$$ P(X_i="the") $$

we will use the simplication

$$ P(the) $$

To compute probabilities of entire sequences like

$$ P(X_n|X_{1:n-1}) $$

One things we can do is decompose this probability using the chain rule of probability:

$$ P(X_1...X_n)=P(X_1)P(X_2|X_1)P(X_3|X_{1:2})...P(X_n|X_{1:n-1}) $$

$$ = \prod_{k=1}^{n}P(X_k|X_{1:k-1}) $$

Applying the chain rule to words, we get:

$$ P(w_{1:n})=P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1}) $$

$$ =\prod_{k=1}^{n}P(w_k|w_{1:k-1}) $$

The chain rule shows the linke between computing the joint probability of a sequence and computing the conditional probability of a word given previous words.

  • We could estimate the joint probability of an entire sequence of words by multiplying together a number of conditional probabilities.

We can't just estimate by counting the number of times every word occurs following every long string, because language is creative and any particular context might have never occurred before.

 

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

  • For example, the bigram model approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word.
    • Instead of computing the probability P(the | Walden Pond's water is so transparent that),
      we approximate it with the probability P(the | that)
  • When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-1}) $$

 

The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

  • Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

A general equation for a n-gram approximation to the conditional probability of the next word in a sequence:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-N+1:n-1}) $$

※ N means the n-gram size (N=2 means bigrams and N=3 means trigrams)

 

An intuitive way to estimate probabilities is called maximum likelihood estimation (MLE).

  • We get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalising the counts so that they lie between 0 and 1.

To compute a particular bigram probability of a word

$$ w_n $$

given a previous word

$$ w_{n-1} $$

we'll compute the count of the bigram

$$ C(w_{n-1}w_n) $$

and normalise by the sum of all the bigrams that share the same first word

$$ w_{n-1} $$

The computation is:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{\sum _wC(w_{n-1}w)} $$

The sum of all bigram counts that start with a given word

$$ W_{n-1} $$

must be equal to the unigram for that word, so the simplified equation is as follows:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{C(w_{n-1})} $$

 

Language model probabilities are always represented and computed in log format as log probabilities.

  • Since probabilities are less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes.

By using log probabilities instead of raw probabilities, we get numbers that are not as small.

  • Adding in log space is equivalent to multiply in linear space, so we combine log probabilities by adding them.

$$ P_1\times P_2\times P_3\times P_4=exp(log{P_1}+log{P_2}+log{P_3}+log{P_4}) $$

The result of doing all computation and storage in log space is that we only need to convert back into probabilities if we need to report them at the end; then we can just take the exp of the logprob!

 

Evaluating language models

  • Extrinsic evaluation is the end-to-end evaluation (often very expensive!)
    → extrinsic evaluation (or task-based evaluation) captures how useful the model is in a particular task
  • Intrinsic evaluation measures the quality of a model independent of any application
    → intrinsic evaluation captures how well the model captures what it is supposed to capture, like probabilities
    • Perplexity (sometimes abbreviated as PP or PPL): the standard intrinsic metric for measuring language model performance, both for simple n-gram language models and for the more sophisticated neutral large language models.

In order to evaluate any machine learning model, at least three distinct data sets are needed:

  1. Training set is used to learn the parameters of the model
  2. Development test set (or devset) is used to see how good the model is by testing on the test
  3. Test set is used to evaluate the model → held-out set of data, not overlapping with the training set

Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data.


The perplexity of a language model on a test set is the inverse probability of the test set (one over the probability of the test set), normalised by the nubmer of words. For this reason, it's sometimes called the per-word perplexity.

For a test set

$$ W=w_1w_2...w_N $$

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}} $$

$$ =\sqrt[N]{\frac{1}{P(w_1w_2...w_N)}} $$

Or we can use the chain rule to expand the probability of W:

$$ \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}} $$

  • Because of the inverse, the higher the probability of the word sequence, the lower the perplexity.
    → Thus the lower the perplexity of a model on the data, the better the model, and minimising perplexity is equivalent to maximising the test set probability according to the language model.

Perplexity can also be thought of as the weighted average branching factor of a language.

  • The branching factor of a language is the number of possible next words that can follow any word.

If we have an artificial deterministic language of integer numbers whose vocabulary consists of the 10 digits, in which any digit can follow any other digit, then the branching factor of that language is 10.

 

Suppose that each of the 10 digits with exactly equal probability:

$$ P=\frac{1}{10} $$

Imagine a test string of digits of length N, and again, assume that in the training set all the digits occurred with equal probability. The perplexity will be

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}}=(\frac{1}{10}^N)^{-\frac{1}{N}}=\frac{1}{10}^{-1}=10 $$

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w08) Vector semantics and embeddings  (0) 2024.05.27
(w07) Lexical semantics  (0) 2024.05.22
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

Regular expression (often shortened to regex) is a formal language for defining text strings (character sequences).

  • It is used for pattern matching (e.g. searching and replacing in text)

Formally, a regular expression is an algebraic notation for characterising a set of strings.

  • Regular expressions are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

The simplest kind of regular expression is a sequence of simple characters; putting characters in sequence is called concatenation.

 

Regular expressions are case sensitive; lower case /s/ is distinct from upper case /S/.

  • /s/ matches a lower case s but not an upper case S.
  • We can solve this problem with the use of the square braces [ and ].
    → The string of characters inside the braces specifies a disjunction of characters to match.
    → e.g. the pattern /[sS]/ matches patterns containing either s or S.
    → e.g. the pattern /[cat|dog]/ matches either the string cat or the string dog (the pipe symbol, |, is used)

 

Regular expression provides a very flexible for doing these transformations:

  1. Disjunctions
    : e.g. r"[Ss]et", r"Set|set", r"[S-s]et" → Find both "Set" and "set"
  2. Negation
    : e.g. r"[^0-9]" → Find characters excluding the numbers from 0 to 9
  3. Optionality
    : e.g. r"beg.n" # . means match anything (wildcard expression)
              → "begin began begun beginning" returns to "X X X Xning"
    : e.g. r"colou?r" # ? means previous character is optional
    : e.g. r"w.*" # * is the Kleene star, meaning match 0 or more of previous char
              → Greedy matching: it searches for a pattern that begins with any word character (w)
                   and then grabs everything (.*) after it, essentially deleting the entire line.
    : e.g. r"w.*?" # make sure the match is non-greedy using the ? character
    : e.g. r"fooo+" # + is the Kleene plus, meaning match 1 or more of previous char
    → In the case of ending with *, regular expressions always match the largest string they can; patterns are greedy, expanding to cover as much of a string as they can. However, there are ways to enforce non-greedy matching, using another meaning of the ? qualifier: both *? and +? operators are a Kleene star that matches as little text as possible.
  4. Aliases
    - \d: any digit → [0-9]
    - \D: any non-digit → [^0-9]
    - \w: any alphanumeric/underscore → [a-zA-Z0-9_]
    - \W: a non-alphanumeric → [^\w]
    - \s: whitespace (space, tab) → [ \r\t\n\f]
    - \S: non-whitespace → [^\s]
  5. Anchors
    : e.g. re.sub('\w+', "", text)) # delete all words
    : e.g. re.sub( '^\w+' , "", text, flags=re.MULTILINE))
              # delete only words at the start of a string
              # switch on multiline mode to delete words at the start each line

    : e.g. re.sub('\W$', "", text, flags=re.MULTILINE)) # use $ to anchor the match at the end of a string

Operator precedence

  1. Parenthesis: ()
  2. Counters: * + ? {}
  3. Sequences and anchors: the ^my end&
  4. Disjunction: |

Thus,

  • Because counters have a higher precedence than sequences, /the*/ matches theeee but not thethe.
  • Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not thany or theny.

 

Stop words

  • High-frequency words, little lexical content (e.g. and, the, of)
  • When used for certain ML tasks can add 'noise' (but not always)
  • Filter out beforehand
  • Language specific, no universal list

Text corpora

  1. Gutenberg corpus (from nltk.corpus import gutenberg)
    : Electronic text archive, containing free electric books → Isolated
  2. Web & chat (from nltk.corpus import nps_chat)
    : Samples of less formal langauge
  3. Brown corpus (from nltk.corpus import brown)
    : Text from 500 sources, categorised by genre → Categorised
  4. Reuters corpus (from nltk.corpus import reuters)
    : 10,788 news docs, tagged with various topics/industries etc. → Overlapping
  5. Inaugural address corpus (from nltk.corpus import inaugural)
    : 55 texts, one for each US presidential address → Temporal

 

Edit distance is a metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

  • Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

+ Recent posts