Natural Language Processing (NLP) is informed by a number of perspectives (disciplines contribute to NLP):
Computer/data science
Theoretical foundation of computation and practical techniques for implementation
Information science
Analysis, classification, manipulation, retrieval and dissemination of information
Computational Linguistics
Use of computational techniques to study linguistic phenomena
Cognitive science
Study of human information processing (perception, language, reasoning, etc.)
NLP adopts multiple paradigms:
Symbolic approaches
Rule-based, hand coded (by linguists/subject matter experts)
Knowledge-intensive
Statistical approaches
Distributional & neural approaches, supervised or unsupervised
Data-intensive
NLP applications:
Text categorisation
Media monitoring
Classify incoming news stories
Search engines
Classify query intent, e.g. search for 'LOG313'
Spam detection
Machine translation
Fully automatic, e.g. Google translate
Semi-automated
Helping human translators
Text summarisation : to manage information in overload, we need to abstract it down to the most important elements or summarise it
Summarisation
Single-document vs. multi-document
Search results
Word processing
Research/analysis tools
Dialog systems
Chatbots
Smartphone speakers
Smartphone assistants
Call handling systems
Travel
Hospitality
Banking
Sentiment Analysis : identify and extract subjective information
Several sub-tasks:
Identify polarity e.g. of movie reviews e.g. positive, negative, or neutral
Identify emotional states e.g. angry, sad, happy, etc
Subjectivity/objectivity identification e.g. “fact” from opinion
Feature/aspect-based : differentiate between specific features or aspects of entities
Text mining
Analogy with Data Mining
Discover or infer new knowledge from unstructured text resources
A<->B and B<->C
Infer A<->C? e.g. link between migraine headaches and magnesium deficiency
Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
Question answering
Going beyond the document retrieval paradigm : provide specific answers to specific questions
Natural language generation
Speech recognition & synthesis
…and lots more
History of NLP
Foundational Insights: 1940s and 1950s
Two foundational paradigms: 1. The automaton, which is the essential information processing unit 2. Probabilistic or information-theoretic models
The automaton arose out of Turing’s (1936) model of algorithmic computation
Chomsky (1956) considered finite state machines as a way to characterise a grammar : he was one of the first people to use these ideas
Shannon (1948) borrowed the concept of entropy from thermodynamics : Entropy is a measure of uncertainty (as entropy approaches 1.0, uncertainty increases)
As a way of measuring the information content of a language
Measured of the entropy of English by using probabilistic techniques based on the concept of entropy
Two camps: 1960s and 1970s
Speech and language processing split into two paradigms: 1. Symbolic: - Chomsky and others on parsing algorithms - Artificial intelligence (1956) work on reasoning and logic - Early natural language understanding (NLU) systems: - Single domains pattern matching - Keyword search - Heuristics for reasoning 2. Statistical (stochastic) - Mosteller and Wallace (1964) applied Byesian methods to the problem of authorship attribution on The Federalist Papers
Early NLP systems : ELIZA and SHRDLU were the highly influential early NLP systems
ELIZA
Wiezenbaum 1966
Pattern matching (ELIZA used elementary keyword spotting techniques)
First chatbot
SHRDLU
Winograd 1972
Natural language understanding
Comprehensive grammar of English
They created this imaginary world called the block’s world (simulated a robot embedded in a world of toy blocks). The user could interact with this block’s world by asking questions and giving commands.
Further developments in the 1960s
First text corpora (corpora is plural of corpus)
The Brown corpus: a one-million-word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), assembled at Brown University in 1963-64 (Kuˇcera and Francis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S. Y. Wang’s 1967 DOC (Dictionary on Computer)
Empiricism: 1980s and 1990s : The rise of the WWW emphasised the need for language-based information retrieval and information extraction.
The return of two classes of models that had lost popularity: 1. Finite-state models: - Finite-state morphology by Kaplan and Kay (1981) and models of syntax by Church (1980) 2. Probabilistic and data-driven approaches: - From speech recognition to part-of-speech tagging, parsing and semantics
Model evaluation
Quantitative metrics, comparison of performance with previous published research
Regular competitive evaluation exercises such as the Message Understanding Conferences (MUC)
The rise of machine learning: 2000s : Large amounts of spoken and written language data became available, including annotated collections e.g. Penn Treebank (Marcus et al. 1993)
Traditional NLP problems, such as parsing and semantic analysis, became problems for supervised learning
Unsupervised statistical approaches began to receive renewed attention
Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) and topic modelling (Blei et al., 2003) demonstrated that effective applications could be constructed from systems trained on unannotated data
Cost and difficulty of producing annotated corpora became a limiting factor for supervised approaches
Ascendance of deep learning: 2010s onwards
Deep learning methods have become pervasive in NLP and AI in general
Advances in technology such as GPUs developed for gaming
Plummeting costs of memory
Wide availability of software platforms
Classic ML methods require analysts to select features based on domain knowledge
Deep learning introduced automated feature engineering: generated by the learning system itself
Collobert et al (2011) applied convolutional neural nets (CNNs) to POS tagging, chunking, NE tags and language modelling
CNNs unable to handle long-distance contextual information
Recurrent neural networks (RNNs) process items as a sequence with a "memory" of previous inputs' : The method is very useful for what we call sequence labelling tasks.
Applicable to many tasks such as:
Word-level: named entity recognition, language modelling
Sentence-level: sentiment analysis, selecting responses to messages
Language generation for machine translation, image captioning, etc.
RNNs are supplemented with long short-term memory (LSTM) or gated recurrent units (GRUs) to improve training performance (the 'vanishing gradient problem').
Deep neural networks do the input-to-target mapping via a deep sequence of simple data tranformations (layers). This transformation implemented by a layer is parameterised by its weights. Weights are also sometimes called the parameters of a layer.
Learning means finding a set of values for the weights of all layers in a network.
The network will correctly map the inputs to their associated targets only if the weights are reasonable.
To control the output of a neural network, we need to be able to measure how far this output is from what we expected. This is the job of the loss function of the network. The loss function is also sometimes called the objective function or cost function.
The loss function takes the predictions of the network and the true target and computes a distance score, capturing how well the network has done.
The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score. This adjustment is the job of the optimiser, which implements what's called the backpropagation algorithm, which is the central algorithm in deep learning.
With every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the training loop.
선형 연상 메모리에 대한 헵 학습의 성능 분석 시 프로토타입 벡터 (p_q) 가 직교하면서 단위 길이를 갖는 정규직교 (orthonormal) 인 경우와 단위 길이를 갖지만 직교하지 않는 경우를 나눠서 살펴볼 수 있다.
정규직교인 경우: p_k 가 네트워크 입력이면 네트워크 출력은 다음과 같이 계산된다. $$ a=Wp_k=(\sum_{q=1}^{Q}t_qp_q^T)p_k=\sum_{q=1}^{Q}t_q(p_q^Tp_k) $$ p_q 가 정규직교이기 때문에 다음과 같이 된다. $$ (p_q^Tp_k)=1, q=k $$ $$ (p_q^Tp_k)=0, q!=k $$ Identity matrix 가 되기 때문에 네트워크 출력은 다음과 같이 간단하게 작성될 수 있다. $$ a=Wp_k=t_k $$
단위 길이이지만, 직교가 아닌 경우: 벡터가 직교하지 않기 때문에 네트워크는 정확한 출력을 생성하지 않을 것이며, 오차의 크기는 프로토타입 입력 패턴 사이에 상관 관계의 크기에 따라 달라진다. $$ a=Wp_k=t_k+\sum_{q!=k}^{}t_q(p_q^Tp_k) $$ 우항의 시그마는 오차를 나타낸다.
예를 들어 weight = 3, p = 2, b = -1.5 일 때, a = f(3 * 2 - 1.5) = f(4.5) 로 출력된다.
다중 입력 뉴런 (Multiple input neuron)
입력이 2개 이상인 뉴런
a = f(W * p + b)
Weight: 가중치 행렬
개별 입력 p1, p2,... 에 대응하는 각각의 가중치 w 존재
일 때
로 표현할 수 있다.
계층
출력 계층 (output layer)
은닉 계층(hidden layer)
1~2번째 은닉계층과 3번째 출력계층으로 이뤄진 네트워크
순환 계층 (recurrent layer)
순환망 (recurrent network) 은 피드백이 있는 네트워크이므로, 피드포워드 (feedforward) 네트워크보다 강력하며 시간적 행동을 보여줄 수 있다.
전달함수 종류
1. 하드 리밋 (Hard Limit)
2. 대칭 하드 리밋 (Symmetrical Hard Limit)
3. 선형 (Linear)
4. 포화 선형 (Saturating Linear)
5. 대칭 포화 선형 (Symmetric Saturating Linear)
6. 로그-시그모이드 (Log-Sigmoid)
7. 하이퍼볼릭 탄젠트 시그모이드 (Hyperbolic Tangent Sigmoid)
8. 양의 선형 (Positive Linear)
9. 경쟁 (Competitive)
입력 2개의 뉴런 파라미터가 $$ b=1.2, W=\begin{bmatrix} 3 & 2 \\ \end{bmatrix}, p=\begin{bmatrix} -5 & 6 \\ \end{bmatrix}^T $$ 인 경우 네트 입력은 $$ n=Wp+b=\begin{bmatrix} 3 & 2 \\ \end{bmatrix}\begin{bmatrix} -5 \\ 6 \end{bmatrix}+(1.2)=-1.8 $$ 로 계산되고, 뉴런 출력은
대칭 하드 리밋 전달 함수: $$ a=hardlims(-1.8)=-1 $$
포화 선형 전달 함수: $$ a=satlin(-1.8)=0 $$
하이퍼볼릭 탄젠트 시그모이드 전달 함수: $$ a=tansig(-1.8)=-0.9468 $$
등으로 구할 수 있다.
피드포워드 네트워크의 이진 패턴 인식에는 경계 값에 따라 분류를 하는 과정에서 애매한 경우 정확도가 떨어질 수 있는 문제가 있다. 이를 보완한 것이 해밍 (hamming) 네트워크와 홉필드 (hopfield) 네트워크이다.
해밍 네트워크는 피드포워드 계층과 순환 계층을 모두 이용한다.
피드포워드 계층에서는 프로토타입 벡터와 입력 벡터의 닮음 정도 (내적) 에 편향 벡터를 더해 값이 절대 음수가 되지 않게 함으로써 순환 계층이 적절히 작동하도록 한다. 순환 계층은 경쟁 계층으로, 피드포워드 계층 출력에 가중 행렬을 곱하는 과정을 반복한다. 가중 행렬은 아래와 같은 형태를 갖는다.