NLP tasks:

  1. Classification taks (e.g. spam detection)
  2. Sequence taks (e.g. text generation)
  3. Meaning tasks

A lexical database:

  • Nodes are synsets
  • Correspond to abstract concepts
  • Ployhierarchical structure
    • A polyhierarchical structure is one that allows multiple parents.

According to WordNet which is a large lexical database of English, a synset or synonym set is defined as a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

 

Using WordNet, we can programmatically:

  • Identify hyponyms (child terms) and hypernyms (parent terms)
  • Measure semantic similarity

The process of classifying words into their parts of speech and labelling them accordingly is known as parts-of-speech tagging, POS-tagging, or simply tagging. Parts-of-speech are also known as word classes or lexical categories.

  • POS-tagger processes a sequence of words, and attaches a part of speech tag to each word.
  • The collection of tags used for a particular task is known as a tagset.

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w08) Vector semantics and embeddings  (0) 2024.05.27
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

Models that assign probabilities to upcoming words, or sequences of words in general, are called language models (LMs).

  • Predicting upcoming words turns out that the large language models that revolutionised modern NLP are trained just by predicting words.

Why does it matter what the probability of a sentence is or how probable the next word is?

  • In many NLP applications, we can use the probability as a way to choose a better sentence or word over a less-appropriate one.

Language models can also help in augmentative and alternative communication (AAC) systems.

  • People often use such AAC devices if they are physically unable to speak or sign but can instead use eye gaze or other specific movements to select words from a menu.
  • Word prediction can be used to suggest likely words for the menu.

N-gram language model

  • An n-gram is a sequence of n words:
    • A 2-gram (bigram) is a two-word sequence of words like "please turn", "turn your", or "your homework".
    • A 3-gram (trigram) is a three word sequence of words like "please turn your", or "turn your homework".
  • The word 'n-gram' is also used to mean a probabilistic model that can estimate the probability of a word given the n-1 previous words, and thereby also to assign probabilities to entire sequences.

Suppose the history h is "its water is so transparent that" and we want to know the probability that the next word is "the":

P(the | its water is so transparent that)

One way to estimate this probability is from relative frequency counts:

  • Take a very large corpus, count the number of times we see "its water is so transparent that", and count the number of times this is followed by "the'.

Out of the times we saw the history h, how many times was it followed by the word w, as follows:

P(the | its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that)

 

To represent the probability of a particular random variable

$$ X_i $$

taking on the value "the" or

$$ P(X_i="the") $$

we will use the simplication

$$ P(the) $$

To compute probabilities of entire sequences like

$$ P(X_n|X_{1:n-1}) $$

One things we can do is decompose this probability using the chain rule of probability:

$$ P(X_1...X_n)=P(X_1)P(X_2|X_1)P(X_3|X_{1:2})...P(X_n|X_{1:n-1}) $$

$$ = \prod_{k=1}^{n}P(X_k|X_{1:k-1}) $$

Applying the chain rule to words, we get:

$$ P(w_{1:n})=P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1}) $$

$$ =\prod_{k=1}^{n}P(w_k|w_{1:k-1}) $$

The chain rule shows the linke between computing the joint probability of a sequence and computing the conditional probability of a word given previous words.

  • We could estimate the joint probability of an entire sequence of words by multiplying together a number of conditional probabilities.

We can't just estimate by counting the number of times every word occurs following every long string, because language is creative and any particular context might have never occurred before.

 

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

  • For example, the bigram model approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word.
    • Instead of computing the probability P(the | Walden Pond's water is so transparent that),
      we approximate it with the probability P(the | that)
  • When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-1}) $$

 

The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

  • Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

A general equation for a n-gram approximation to the conditional probability of the next word in a sequence:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-N+1:n-1}) $$

※ N means the n-gram size (N=2 means bigrams and N=3 means trigrams)

 

An intuitive way to estimate probabilities is called maximum likelihood estimation (MLE).

  • We get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalising the counts so that they lie between 0 and 1.

To compute a particular bigram probability of a word

$$ w_n $$

given a previous word

$$ w_{n-1} $$

we'll compute the count of the bigram

$$ C(w_{n-1}w_n) $$

and normalise by the sum of all the bigrams that share the same first word

$$ w_{n-1} $$

The computation is:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{\sum _wC(w_{n-1}w)} $$

The sum of all bigram counts that start with a given word

$$ W_{n-1} $$

must be equal to the unigram for that word, so the simplified equation is as follows:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{C(w_{n-1})} $$

 

Language model probabilities are always represented and computed in log format as log probabilities.

  • Since probabilities are less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes.

By using log probabilities instead of raw probabilities, we get numbers that are not as small.

  • Adding in log space is equivalent to multiply in linear space, so we combine log probabilities by adding them.

$$ P_1\times P_2\times P_3\times P_4=exp(log{P_1}+log{P_2}+log{P_3}+log{P_4}) $$

The result of doing all computation and storage in log space is that we only need to convert back into probabilities if we need to report them at the end; then we can just take the exp of the logprob!

 

Evaluating language models

  • Extrinsic evaluation is the end-to-end evaluation (often very expensive!)
  • Intrinsic evaluation measures the quality of a model independent of any application
    • Perplexity (sometimes abbreviated as PP or PPL): the standard intrinsic metric for measuring language model performance, both for simple n-gram language models and for the more sophisticated neutral large language models.

In order to evaluate any machine learning model, at least three distinct data sets are needed:

  1. Training set is used to learn the parameters of the model
  2. Development test set (or devset) is used to see how good the model is by testing on the test
  3. Test set is used to evaluate the model → held-out set of data, not overlapping with the training set

Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data.


The perplexity of a language model on a test set is the inverse probability of the test set (one over the probability of the test set), normalised by the nubmer of words. For this reason, it's sometimes called the per-word perplexity.

For a test set

$$ W=w_1w_2...w_N $$

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}} $$

$$ =\sqrt[N]{\frac{1}{P(w_1w_2...w_N)}} $$

Or we can use the chain rule to expand the probability of W:

$$ \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}} $$

  • Because of the inverse, the higher the probability of the word sequence, the lower the perplexity.
    → Thus the lower the perplexity of a model on the data, the better the model, and minimising perplexity is equivalent to maximising the test set probability according to the language model.

Perplexity can also be thought of as the weighted average branching factor of a language.

  • The branching factor of a language is the number of possible next words that can follow any word.

If we have an artificial deterministic language of integer numbers whose vocabulary consists of the 10 digits, in which any digit can follow any other digit, then the branching factor of that language is 10.

 

Suppose that each of the 10 digits with exactly equal probability:

$$ P=\frac{1}{10} $$

Imagine a test string of digits of length N, and again, assume that in the training set all the digits occurred with equal probability. The perplexity will be

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}}=(\frac{1}{10}^N)^{-\frac{1}{N}}=\frac{1}{10}^{-1}=10 $$

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w08) Vector semantics and embeddings  (0) 2024.05.27
(w07) Lexical semantics  (0) 2024.05.22
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

NLP tasks can be categorised by problem type:

  • Classification
    • Sentiment classification
    • News categorisation
  • Regression
    • Essay scoring
  • Sequence labelling
    • Part of speech tagging, named entity recognition

How to evaluate the models? Here is an example for a classification problem below:

  • Imagine we are building a spam classifier
    : Predict whether email messages will be filtered or not
    • Input = feature matrix (email message)
    • Output = target vector (yes/no)
  • Model could be Naive Bayes, k-nearest neighbour, etc.
  • This is a binary classification problem

In the case, the goal is to predict 'spam' or 'not spam' for email messages. As before:

  1. Choose a class of model
  2. Set model hyperparameters
  3. Configure the data (X and y)
  4. Fit the model to the data
  5. Apply model to new (unseen) data

To measure performance, we should consider several factors, including

  • Metric(s)
    • These are quantitative measures that assess how well a model performs. A common metric is accuracy, which is calculated as the number of correct predictions divided by the total number of predictions (n).
  • Balance of the dataset
    • This refers to the distribution of classes within your data. An imbalanced dataset can skew the performance metrics, so it's important to consider this factor as well (for an unbalanced dataset, we can achieve high accuracy simply by selecting the majority class).

Another example for a classification problem:

  • Imagine you work in a hospital
    : Predict whether a CT scan shows tumour or not
    • Tumours are rare wvents, so the classes are unbalanced
      : The cost of missing a tumour is much higher than a 'false alarm'
  • Accuracy is not a good metric

In the case, the confusion matrix can be used to compare the predicted values with actual values (ground truth):

  Predicted Actual
True Positive (TP) Positive Positive
False Positive (FP) Positive Negative
False Negative (FN) Negative Positive
True Negative (TN) Negative Negative

 

Confusion Matrix Actual Values
Positive Negative
Predicted
Values
Positive TP FP
Negative FN TN

 

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Recall = TP / (TP + FN)
    : Recall is the proportion of actual positive values that are predicted positive
  • Precision = TP / (TP + FP)
    : Precision is the proportion of predicted positive values that are actually positive

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w01) NLP applications  (0) 2024.04.17

Natural Language Processing (NLP) is informed by a number of perspectives (disciplines contribute to NLP):

  • Computer/data science
    • Theoretical foundation of computation and practical techniques for implementation
  • Information science
    • Analysis, classification, manipulation, retrieval and dissemination of information
  • Computational Linguistics
    • Use of computational techniques to study linguistic phenomena
  • Cognitive science
    • Study of human information processing (perception, language, reasoning, etc.)

NLP adopts multiple paradigms:

  • Symbolic approaches
    • Rule-based, hand coded (by linguists/subject matter experts)
    • Knowledge-intensive
  • Statistical approaches
    • Distributional & neural approaches, supervised or unsupervised
    • Data-intensive

NLP applications:

  • Text categorisation
    • Media monitoring
      • Classify incoming news stories
    • Search engines
      • Classify query intent, e.g. search for 'LOG313'
    • Spam detection
  • Machine translation
    • Fully automatic, e.g. Google translate
    • Semi-automated
      • Helping human translators
  • Text summarisation
    : to manage information in overload, we need to abstract it down to the most important elements or summarise it
    • Summarisation
      • Single-document vs. multi-document
    • Search results
    • Word processing
    • Research/analysis tools
  • Dialog systems
    • Chatbots
    • Smartphone speakers
    • Smartphone assistants
    • Call handling systems
      • Travel
      • Hospitality
      • Banking
  • Sentiment Analysis
    : identify and extract subjective information
    • Several sub-tasks:
      • Identify polarity
        e.g. of movie reviews
        e.g. positive, negative, or neutral
      • Identify emotional states
        e.g. angry, sad, happy, etc
      • Subjectivity/objectivity identification
        e.g. “fact” from opinion
      • Feature/aspect-based
        : differentiate between specific features or aspects of entities
  • Text mining
    • Analogy with Data Mining
      • Discover or infer new knowledge from unstructured text resources
    • A<->B and B<->C
      • Infer A<->C?
        e.g. link between migraine headaches and magnesium deficiency
    • Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
  • Question answering
    • Going beyond the document retrieval paradigm
      : provide specific answers to specific questions
  • Natural language generation
  • Speech recognition & synthesis

…and lots more

 

History of NLP

  • Foundational Insights: 1940s and 1950s
    • Two foundational paradigms:
      1. The automaton, which is the essential information processing unit
      2. Probabilistic or information-theoretic models
    • The automaton arose out of Turing’s (1936) model of algorithmic computation
      • Chomsky (1956) considered finite state machines as a way to characterise a grammar
        : he was one of the first people to use these ideas
    • Shannon (1948) borrowed the concept of entropy from thermodynamics
      : Entropy is a measure of uncertainty (as entropy approaches 1.0, uncertainty increases)
      • As a way of measuring the information content of a language
      • Measured of the entropy of English by using probabilistic techniques based on the concept of entropy
  • Two camps: 1960s and 1970s
    • Speech and language processing split into two paradigms:
      1. Symbolic:
           - Chomsky and others on parsing algorithms
           - Artificial intelligence (1956) work on reasoning and logic
           - Early natural language understanding (NLU) systems:
                - Single domains pattern matching
                - Keyword search
                - Heuristics for reasoning
      2. Statistical (stochastic)
           - Mosteller and Wallace (1964) applied Byesian methods to the problem of authorship attribution on The Federalist Papers
  • Early NLP systems
    : ELIZA and SHRDLU were the highly influential early NLP systems
    • ELIZA
      •  Wiezenbaum 1966
      • Pattern matching (ELIZA used elementary keyword spotting techniques)
      • First chatbot
    •  SHRDLU
      • Winograd 1972
      • Natural language understanding
      • Comprehensive grammar of English
        They created this imaginary world called the block’s world (simulated a robot embedded in a world of toy blocks). The user could interact with this block’s world by asking questions and giving commands.
    • Further developments in the 1960s
      • First text corpora (corpora is plural of corpus)
        • The Brown corpus: a one-million-word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), assembled at Brown University in 1963-64 (Kuˇcera and Francis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S. Y. Wang’s 1967 DOC (Dictionary on Computer)
    • Empiricism: 1980s and 1990s
      : The rise of the WWW emphasised the need for language-based information retrieval and information extraction.
      • The return of two classes of models that had lost popularity:
        1. Finite-state models:
             - Finite-state morphology by Kaplan and Kay (1981) and models of syntax by Church (1980)
        2. Probabilistic and data-driven approaches:
             - From speech recognition to part-of-speech tagging, parsing and semantics
      • Model evaluation
        • Quantitative metrics, comparison of performance with previous published research
        • Regular competitive evaluation exercises such as the Message Understanding Conferences (MUC)
    • The rise of machine learning: 2000s
      : Large amounts of spoken and written language data became available, including annotated collections
      e.g. Penn Treebank (Marcus et al. 1993)
      • Traditional NLP problems, such as parsing and semantic analysis, became problems for supervised learning
      • Unsupervised statistical approaches began to receive renewed attention
        • Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) and topic modelling (Blei et al., 2003) demonstrated that effective applications could be constructed from systems trained on unannotated data
        • Cost and difficulty of producing annotated corpora became a limiting factor for supervised approaches
    • Ascendance of deep learning: 2010s onwards
      • Deep learning methods have become pervasive in NLP and AI in general
        • Advances in technology such as GPUs developed for gaming
        • Plummeting costs of memory
        • Wide availability of software platforms
      • Classic ML methods require analysts to select features based on domain knowledge
        • Deep learning introduced automated feature engineering: generated by the learning system itself
      • Collobert et al (2011) applied convolutional neural nets (CNNs) to POS tagging, chunking, NE tags and language modelling
        • CNNs unable to handle long-distance contextual information
      • Recurrent neural networks (RNNs) process items as a sequence with a "memory" of previous inputs'
        : The method is very useful for what we call sequence labelling tasks.
        • Applicable to many tasks such as:
          • Word-level: named entity recognition, language modelling
          • Sentence-level: sentiment analysis, selecting responses to messages
          • Language generation for machine translation, image captioning, etc.

RNNs are supplemented with long short-term memory (LSTM) or gated recurrent units (GRUs) to improve training performance (the 'vanishing gradient problem').

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

+ Recent posts