Text is often referred to as unstructured data.

  • Text has plenty of structure, but it is linguistic structure as it is intended for human consumption, not for computers.

 

Text may contain synonyms (multiple words with the same meaning) and homographs (one spelling shared among multiple words with different meanings).

 

People write ungrammatically, misspell words, run words together, abbreviate unpredictably, and punctuate randomly.

 

Because text is intended for communication between people, context is important.

 

The general strategy in text mining is to use the simplest technique that works.

  • A document is composed of individual tokens or terms.
  • A collection of documents is called a corpus.

 

Language is ambiguous. To determine structure, we must resolve ambiguity.

  • Processing text data:
    • Lexical analysis (tokenisation)
      : Tokenisation is the task of chopping it up into pieces, called tokens.
    • Stop word removal
      : A stopword is a very common word in English. The words the, and, of and on are considered stopwords so they are typically removed.
    • Stemming
      : Suffixes are removed so that verbs like announces, announced and announcing are all reduced to the term accounc. Stemming also transforms noun plurals to the singular forms, so directors becomes director.
    • Lemmatisation
      : A lemma is the canonical form, dictionary form, or citation form of a set of word forms. For example, break, breaks, broke, broken and breaking are forms of the same lexeme, with break as the lemma by which they are indexed. Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
    • Morphology (prefixes, suffixes, etc.)
  • The higher levels of ambiguity:
    • Syntax (part of speech tagging)
      • Ambiguity problem
    • Parsing (grammar)
    • Sentence boundary detection

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w02) NLP evaluation -basic  (0) 2024.04.17
(w01) NLP applications  (0) 2024.04.17

NLP tasks can be categorised by problem type:

  • Classification
    • Sentiment classification
    • News categorisation
  • Regression
    • Essay scoring
  • Sequence labelling
    • Part of speech tagging, named entity recognition

How to evaluate the models? Here is an example for a classification problem below:

  • Imagine we are building a spam classifier
    : Predict whether email messages will be filtered or not
    • Input = feature matrix (email message)
    • Output = target vector (yes/no)
  • Model could be Naive Bayes, k-nearest neighbour, etc.
  • This is a binary classification problem

In the case, the goal is to predict 'spam' or 'not spam' for email messages. As before:

  1. Choose a class of model
  2. Set model hyperparameters
  3. Configure the data (X and y)
  4. Fit the model to the data
  5. Apply model to new (unseen) data

To measure performance, we should consider several factors, including

  • Metric(s)
    • These are quantitative measures that assess how well a model performs. A common metric is accuracy, which is calculated as the number of correct predictions divided by the total number of predictions (n).
  • Balance of the dataset
    • This refers to the distribution of classes within your data. An imbalanced dataset can skew the performance metrics, so it's important to consider this factor as well (for an unbalanced dataset, we can achieve high accuracy simply by selecting the majority class).

Another example for a classification problem:

  • Imagine you work in a hospital
    : Predict whether a CT scan shows tumour or not
    • Tumours are rare wvents, so the classes are unbalanced
      : The cost of missing a tumour is much higher than a 'false alarm'
  • Accuracy is not a good metric

In the case, the confusion matrix can be used to compare the predicted values with actual values (ground truth):

  Predicted Actual
True Positive (TP) Positive Positive
False Positive (FP) Positive Negative
False Negative (FN) Negative Positive
True Negative (TN) Negative Negative

 

Confusion Matrix Actual Values
Positive Negative
Predicted
Values
Positive TP FP
Negative FN TN

 

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Recall = TP / (TP + FN)
    : Recall is the proportion of actual positive values that are predicted positive
  • Precision = TP / (TP + FP)
    : Precision is the proportion of predicted positive values that are actually positive

 

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w01) NLP applications  (0) 2024.04.17

Natural Language Processing (NLP) is informed by a number of perspectives (disciplines contribute to NLP):

  • Computer/data science
    • Theoretical foundation of computation and practical techniques for implementation
  • Information science
    • Analysis, classification, manipulation, retrieval and dissemination of information
  • Computational Linguistics
    • Use of computational techniques to study linguistic phenomena
  • Cognitive science
    • Study of human information processing (perception, language, reasoning, etc.)

NLP adopts multiple paradigms:

  • Symbolic approaches
    • Rule-based, hand coded (by linguists/subject matter experts)
    • Knowledge-intensive
  • Statistical approaches
    • Distributional & neural approaches, supervised or unsupervised
    • Data-intensive

NLP applications:

  • Text categorisation
    • Media monitoring
      • Classify incoming news stories
    • Search engines
      • Classify query intent, e.g. search for 'LOG313'
    • Spam detection
  • Machine translation
    • Fully automatic, e.g. Google translate
    • Semi-automated
      • Helping human translators
  • Text summarisation
    : to manage information in overload, we need to abstract it down to the most important elements or summarise it
    • Summarisation
      • Single-document vs. multi-document
    • Search results
    • Word processing
    • Research/analysis tools
  • Dialog systems
    • Chatbots
    • Smartphone speakers
    • Smartphone assistants
    • Call handling systems
      • Travel
      • Hospitality
      • Banking
  • Sentiment Analysis
    : identify and extract subjective information
    • Several sub-tasks:
      • Identify polarity
        e.g. of movie reviews
        e.g. positive, negative, or neutral
      • Identify emotional states
        e.g. angry, sad, happy, etc
      • Subjectivity/objectivity identification
        e.g. “fact” from opinion
      • Feature/aspect-based
        : differentiate between specific features or aspects of entities
  • Text mining
    • Analogy with Data Mining
      • Discover or infer new knowledge from unstructured text resources
    • A<->B and B<->C
      • Infer A<->C?
        e.g. link between migraine headaches and magnesium deficiency
    • Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
  • Question answering
    • Going beyond the document retrieval paradigm
      : provide specific answers to specific questions
  • Natural language generation
  • Speech recognition & synthesis

…and lots more

 

History of NLP

  • Foundational Insights: 1940s and 1950s
    • Two foundational paradigms:
      1. The automaton, which is the essential information processing unit
      2. Probabilistic or information-theoretic models
    • The automaton arose out of Turing’s (1936) model of algorithmic computation
      • Chomsky (1956) considered finite state machines as a way to characterise a grammar
        : he was one of the first people to use these ideas
    • Shannon (1948) borrowed the concept of entropy from thermodynamics
      : Entropy is a measure of uncertainty (as entropy approaches 1.0, uncertainty increases)
      • As a way of measuring the information content of a language
      • Measured of the entropy of English by using probabilistic techniques based on the concept of entropy
  • Two camps: 1960s and 1970s
    • Speech and language processing split into two paradigms:
      1. Symbolic:
           - Chomsky and others on parsing algorithms
           - Artificial intelligence (1956) work on reasoning and logic
           - Early natural language understanding (NLU) systems:
                - Single domains pattern matching
                - Keyword search
                - Heuristics for reasoning
      2. Statistical (stochastic)
           - Mosteller and Wallace (1964) applied Byesian methods to the problem of authorship attribution on The Federalist Papers
  • Early NLP systems
    : ELIZA and SHRDLU were the highly influential early NLP systems
    • ELIZA
      •  Wiezenbaum 1966
      • Pattern matching (ELIZA used elementary keyword spotting techniques)
      • First chatbot
    •  SHRDLU
      • Winograd 1972
      • Natural language understanding
      • Comprehensive grammar of English
        They created this imaginary world called the block’s world (simulated a robot embedded in a world of toy blocks). The user could interact with this block’s world by asking questions and giving commands.
    • Further developments in the 1960s
      • First text corpora (corpora is plural of corpus)
        • The Brown corpus: a one-million-word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), assembled at Brown University in 1963-64 (Kuˇcera and Francis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S. Y. Wang’s 1967 DOC (Dictionary on Computer)
    • Empiricism: 1980s and 1990s
      : The rise of the WWW emphasised the need for language-based information retrieval and information extraction.
      • The return of two classes of models that had lost popularity:
        1. Finite-state models:
             - Finite-state morphology by Kaplan and Kay (1981) and models of syntax by Church (1980)
        2. Probabilistic and data-driven approaches:
             - From speech recognition to part-of-speech tagging, parsing and semantics
      • Model evaluation
        • Quantitative metrics, comparison of performance with previous published research
        • Regular competitive evaluation exercises such as the Message Understanding Conferences (MUC)
    • The rise of machine learning: 2000s
      : Large amounts of spoken and written language data became available, including annotated collections
      e.g. Penn Treebank (Marcus et al. 1993)
      • Traditional NLP problems, such as parsing and semantic analysis, became problems for supervised learning
      • Unsupervised statistical approaches began to receive renewed attention
        • Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) and topic modelling (Blei et al., 2003) demonstrated that effective applications could be constructed from systems trained on unannotated data
        • Cost and difficulty of producing annotated corpora became a limiting factor for supervised approaches
    • Ascendance of deep learning: 2010s onwards
      • Deep learning methods have become pervasive in NLP and AI in general
        • Advances in technology such as GPUs developed for gaming
        • Plummeting costs of memory
        • Wide availability of software platforms
      • Classic ML methods require analysts to select features based on domain knowledge
        • Deep learning introduced automated feature engineering: generated by the learning system itself
      • Collobert et al (2011) applied convolutional neural nets (CNNs) to POS tagging, chunking, NE tags and language modelling
        • CNNs unable to handle long-distance contextual information
      • Recurrent neural networks (RNNs) process items as a sequence with a "memory" of previous inputs'
        : The method is very useful for what we call sequence labelling tasks.
        • Applicable to many tasks such as:
          • Word-level: named entity recognition, language modelling
          • Sentence-level: sentiment analysis, selecting responses to messages
          • Language generation for machine translation, image captioning, etc.

RNNs are supplemented with long short-term memory (LSTM) or gated recurrent units (GRUs) to improve training performance (the 'vanishing gradient problem').

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

+ Recent posts