Natural Language Processing (NLP) is informed by a number of perspectives (disciplines contribute to NLP):

  • Computer/data science
    • Theoretical foundation of computation and practical techniques for implementation
  • Information science
    • Analysis, classification, manipulation, retrieval and dissemination of information
  • Computational Linguistics
    • Use of computational techniques to study linguistic phenomena
  • Cognitive science
    • Study of human information processing (perception, language, reasoning, etc.)

NLP adopts multiple paradigms:

  • Symbolic approaches
    • Rule-based, hand coded (by linguists/subject matter experts)
    • Knowledge-intensive
  • Statistical approaches
    • Distributional & neural approaches, supervised or unsupervised
    • Data-intensive

NLP applications:

  • Text categorisation
    • Media monitoring
      • Classify incoming news stories
    • Search engines
      • Classify query intent, e.g. search for 'LOG313'
    • Spam detection
  • Machine translation
    • Fully automatic, e.g. Google translate
    • Semi-automated
      • Helping human translators
  • Text summarisation
    : to manage information in overload, we need to abstract it down to the most important elements or summarise it
    • Summarisation
      • Single-document vs. multi-document
    • Search results
    • Word processing
    • Research/analysis tools
  • Dialog systems
    • Chatbots
    • Smartphone speakers
    • Smartphone assistants
    • Call handling systems
      • Travel
      • Hospitality
      • Banking
  • Sentiment Analysis
    : identify and extract subjective information
    • Several sub-tasks:
      • Identify polarity
        e.g. of movie reviews
        e.g. positive, negative, or neutral
      • Identify emotional states
        e.g. angry, sad, happy, etc
      • Subjectivity/objectivity identification
        e.g. “fact” from opinion
      • Feature/aspect-based
        : differentiate between specific features or aspects of entities
  • Text mining
    • Analogy with Data Mining
      • Discover or infer new knowledge from unstructured text resources
    • A<->B and B<->C
      • Infer A<->C?
        e.g. link between migraine headaches and magnesium deficiency
    • Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
  • Question answering
    • Going beyond the document retrieval paradigm
      : provide specific answers to specific questions
  • Natural language generation
  • Speech recognition & synthesis

…and lots more

 

History of NLP

  • Foundational Insights: 1940s and 1950s
    • Two foundational paradigms:
      1. The automaton, which is the essential information processing unit
      2. Probabilistic or information-theoretic models
    • The automaton arose out of Turing’s (1936) model of algorithmic computation
      • Chomsky (1956) considered finite state machines as a way to characterise a grammar
        : he was one of the first people to use these ideas
    • Shannon (1948) borrowed the concept of entropy from thermodynamics
      : Entropy is a measure of uncertainty (as entropy approaches 1.0, uncertainty increases)
      • As a way of measuring the information content of a language
      • Measured of the entropy of English by using probabilistic techniques based on the concept of entropy
  • Two camps: 1960s and 1970s
    • Speech and language processing split into two paradigms:
      1. Symbolic:
           - Chomsky and others on parsing algorithms
           - Artificial intelligence (1956) work on reasoning and logic
           - Early natural language understanding (NLU) systems:
                - Single domains pattern matching
                - Keyword search
                - Heuristics for reasoning
      2. Statistical (stochastic)
           - Mosteller and Wallace (1964) applied Byesian methods to the problem of authorship attribution on The Federalist Papers
  • Early NLP systems
    : ELIZA and SHRDLU were the highly influential early NLP systems
    • ELIZA
      •  Wiezenbaum 1966
      • Pattern matching (ELIZA used elementary keyword spotting techniques)
      • First chatbot
    •  SHRDLU
      • Winograd 1972
      • Natural language understanding
      • Comprehensive grammar of English
        They created this imaginary world called the block’s world (simulated a robot embedded in a world of toy blocks). The user could interact with this block’s world by asking questions and giving commands.
    • Further developments in the 1960s
      • First text corpora (corpora is plural of corpus)
        • The Brown corpus: a one-million-word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), assembled at Brown University in 1963-64 (Kuˇcera and Francis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S. Y. Wang’s 1967 DOC (Dictionary on Computer)
    • Empiricism: 1980s and 1990s
      : The rise of the WWW emphasised the need for language-based information retrieval and information extraction.
      • The return of two classes of models that had lost popularity:
        1. Finite-state models:
             - Finite-state morphology by Kaplan and Kay (1981) and models of syntax by Church (1980)
        2. Probabilistic and data-driven approaches:
             - From speech recognition to part-of-speech tagging, parsing and semantics
      • Model evaluation
        • Quantitative metrics, comparison of performance with previous published research
        • Regular competitive evaluation exercises such as the Message Understanding Conferences (MUC)
    • The rise of machine learning: 2000s
      : Large amounts of spoken and written language data became available, including annotated collections
      e.g. Penn Treebank (Marcus et al. 1993)
      • Traditional NLP problems, such as parsing and semantic analysis, became problems for supervised learning
      • Unsupervised statistical approaches began to receive renewed attention
        • Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) and topic modelling (Blei et al., 2003) demonstrated that effective applications could be constructed from systems trained on unannotated data
        • Cost and difficulty of producing annotated corpora became a limiting factor for supervised approaches
    • Ascendance of deep learning: 2010s onwards
      • Deep learning methods have become pervasive in NLP and AI in general
        • Advances in technology such as GPUs developed for gaming
        • Plummeting costs of memory
        • Wide availability of software platforms
      • Classic ML methods require analysts to select features based on domain knowledge
        • Deep learning introduced automated feature engineering: generated by the learning system itself
      • Collobert et al (2011) applied convolutional neural nets (CNNs) to POS tagging, chunking, NE tags and language modelling
        • CNNs unable to handle long-distance contextual information
      • Recurrent neural networks (RNNs) process items as a sequence with a "memory" of previous inputs'
        : The method is very useful for what we call sequence labelling tasks.
        • Applicable to many tasks such as:
          • Word-level: named entity recognition, language modelling
          • Sentence-level: sentiment analysis, selecting responses to messages
          • Language generation for machine translation, image captioning, etc.

RNNs are supplemented with long short-term memory (LSTM) or gated recurrent units (GRUs) to improve training performance (the 'vanishing gradient problem').

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w07) Lexical semantics  (0) 2024.05.22
(w06) N-gram Language Models  (0) 2024.05.14
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

In the field of reinforcement learning, the action policy is a mapping between states and actions, denoted by the Greek letter 'π' (pi). This means that the policy, given state s, will recommend taking action a: π(s) → a.

  • State: s
  • Action: a
  • Next state: s'

A Markov Decision Process is a way of formalising a stochastic sequential decision problem.

  • State transitions: P(s' | s, a)
  • Reward function: R(s, a, s')

Formalising is basically expressing something in a clear and mathematical way. It can then be used to build algorithms and a stochastic sequential decision problem. Stochastic means it has a probabilistic element. It is sarcastic because it might go to other states probabilistically based on what the current state is.

 

Unlike a one-time decision, an optimal policy in reinforcement learning considers the long-term reward. It aims to maximise the total reward accumulated over a sequence of actions, even if some rewards come much later.

 

Bellman equation(ish) defining the value of a given action in a given state based on future reward.

The value of action a in state s is the reward for s + max possible future reward for states s at time t, increasingly discounted by gamma (γ), discount factor, raised to the power of t.

  • Gamma (γ) is less than 1.
    • Assuming γ=0.5, then γ squared would be 0.25. So we're only taking a quart in a quarter of the reward. We're not confident of what's going to happen in the future, so we're going to take a little bit of the reward that comes in the future.

Note that the probability of state transitions are not included here.

 

What are the future states and rewards?

  • The state transition matrix describes how the environment reacts to the chosen actions (how the state will change over time based on the chosen actions). It tells us the probability of reaching different states after taking specific actions in the current state.
  • The action policy, on the other hand, guides the decision-making. It takes the current state as input and recommends which action to take. This recommendation can be based on maximising immediate reward, long-term reward, or other criteria depending on the specific policy.

In reinforcement learning, creating an optimal action policy often requires complete knowledge of the environment. This includes knowing the transition matrix (all possible state transitions based on actions) and the rewards associated with each transition.

 

However, in most real-world scenarios, this information is incomplete. Q-learning is a technique that addresses this challenge. It focuses on learning a Q-value function, which estimates the expected future reward for taking a specific action in a particular state.

  • The goal of Q-learning is to find the optimal Q-function (Q)*, which tells us the best action to take in any given state to maximise future rewards.

There are various methods to do Q-learning, but most of them don't work for real problems. The one that works here is to approximate the value function using a deep network called Deep Q Network (DQN).

 

DQN agent architecture

  • An agent is an entity that can observe and act autonomously.
    • We need an agent architecture that solves two problems: no state transition matrix and no action policy.

We explore the game and make observations of the form: s, a, s', r, and done.

  • s = state now
  • a = action taken
  • s' = next state
  • r = reward
  • done = true/false is the game finished?

For DQN, this is the 'replay buffer'. Over time, the agent fills up a large replay buffer. The example for one state, s1, and the three actions, a1/a2/a3 is as follows:

  1. s1, a1 → s2, r1, d0
  2. s1, a2 → s3, r2, d0
  3. s1, a3 → s4, r3, d1

The epsilon greedy exploration is the method that the agent uses to fill up the replay buffer. The acting policy is a simple and effective method for balancing exploration and exploitation by estimating the highest rewards.

  • Epsilon-greedy works by introducing a probability (epsilon, ε) of taking a random action instead of the one with the highest estimated Q-value. This encourages exploration and helps the agent discover potentially better actions it might not have encountered yet.

The DQN agent

  • Knows about the states and rewards
  • Acts in the game world by taking actions
  • Makes observations of what is happening in the game
  • Replay buffer consists of many observations of the actions taken by the agent in the game world and the results of those actions

In Deep Q-Networks (DQN), a crucial part of the training process is the loss function. This function helps the network learn by measuring the difference between its predictions and the desired outcome.

  • The theta (Θ) is the weight in the network.
  • Θ ̄ is the old version of the network.

To train the network, we use a technique called experience replay. We store past experiences (state, action, reward, next state) in a replay buffer (D). During training, we uniformly sample a mini-batch of these experiences, denoted by U(D), to create a training set. This training set feeds the network and helps it learn from various past experiences.

'ArtificialIntelligence > Concept' 카테고리의 다른 글

(w06) Bio-inspired computing (BIC)  (0) 2024.05.15
(w05) Ethics of game playing AI  (0) 2024.05.06

Analogue to Digital Converter (ADC)

  1. The microphone (transducer) converts air pressure changes into an electrical signal.
  2. The electrical energy generated by a microphone is usually quite small, so we need a device, called a preamplifier, to convert the weak electrical signal generated by the microphone into an output signal strong (larger) enough to be digitised.
  3. ADC samples incoming analogue voltage at a specific rate and assigns a digital value to each sample. These digital values are then usable by the digital devices.

The act of assigning an amplitude value to the sample is called quantising and the number of amplitude values available to the ADC is called the sample resolution.

 

Once the audio has entered the digital domain, the possibilities for editing, processing, and mixing are nearly endless. When digital audio is played back, the signal is first sent through a DAC.

 

Digital to Analogue Converter (DAC)

In the opposite case,

  1. DAC converts the digital signal back into an analogue electrical signal.
  2. An amplifier amplifies the level of the signal and sends this signal to a speaker or headphones that will generate the sound wave.

We can perceive the sound wave as a sound. In the context of digital audio playback, the DAC is built into the audio output of the computer or into an audio interface. Some computer speakers connect directly to the computer via USB and therefore have DACs built into them.

 

Audio recording path summary

  1. Vibrations in the air are converted to an analogue electrical signal by a microphone.
  2. The microphone signal is increased by a preamplifier.
  3. The preamplifier signal is converted to a digital signal by an ADC.
  4. The digital signal is stored, edited, processed, mixed, and mastered in software.
  5. The digital signal is played back and converted to an analogue electrical signal by a DAC.
  6. The analogue electrical signal is made larger by an amplifier.
  7. The output of the amplifier is converted into vibrations in the air by a loudspeaker.

 

Sampling rate (frequency)

  • Each measurement of the waveform's amplitude is called a sample.
  • The number of measurements (samples) taken per second is called the sampling rate (Hz).

The faster we sample the better the quality, but the more samples we take the more memory size we need.

 

The Nyquist-Shannon sampling theorem

The Nyquist theorem defines the minimum sample rate for the highest frequency that we want to measure. The Nyquist frequency, also called the Nyquist limit, is the sample rate divided by two.

  • This theorem says that the signal above the Nyquist frequency is not recorded properly by ADCs, introducing artificial frequencies in a process called aliasing. If the Nyquist theorem is not obeyed, higher frequency information is recorded in too low a sample rate, resulting in aliasing artefacts.
  • The sampling rate must be at least twice the frequency of the signal being sampled.

An anti-aliasing filter is a low-pass filter that eliminates frequencies above the Nyquist frequency before audio reaches the ADC.

 

Bit depth

  • Bit depth, also known as sample width and quantisation level, is the number of bits used to record the amplitude measurements.
  • The more bits we use, the more accurately we can measure the analogue waveform and the more hard disk space or memory size we need.

Common bit widths used for digital sound representation are 8, 16, 24, and 32 bits.

더보기

For example, what is approximately the size of an uncompressed stereo audio file, the sound time of which is one minute at a sampling frequency of 44.1 kHz and a resolution of 16 bits? The answer is as follows:

44,100 samples/second * 16 bits * 60 seconds * 2 channels = 84,672,000 bits = 10.584 MB

 

Clipping

Clipping occurs in an ADC when the analog input signal exceeds the converter's maximum capacity. This overload forces the ADC to assign either the maximum or minimum digital value to affected samples, resulting in a flat-topped or flat-bottomed waveform. This distortion is undesirable and should be avoided. If the level meter reads zero (or the clipping indicator turns red), this means the signal is clipping!

 

Digital audio representation

All these processes generate an array of samples that we can use to create a new file to process the audio in real-time in the computer, to store the data in a CD, etc. There are two ways of representing digital audio:

1. The time domain representation gives the amplitude of the signal at the instance of time during which it was sampled.

  • Time can be expressed as a decimal format. It can also be expressed in terms of samples, for example, we have seconds in the graph.
  • Amplitude has normalised values between 1 and -1. In using the normalised values, we can find programs that in decibels or even in the sample values.

We can use decibels to represent the values of the samples, but that is not the same as dB SPL. dB FS stands for decibels Full Scale.

  • For example, in Audacity, the meters are in decibels that go from zero to minus infinite.

2. The frequency domain gives us information about the frequencies of a sound (sounds can be composed of millions of frequencies as they don't just have one frequency).

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.

A spectrogram is very similar to the frequency domain representation, but it provides more information about the time-varying nature of vibration, while frequency domain analyses provide information at a specific moment or as an average over time.

'IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w04) Filtering  (0) 2024.05.01
(w03) Audio processing  (0) 2024.04.26
(w01) Audio fundamentals  (0) 2024.04.11

What is SOUND?

  • In terms of physics, sound is a form of energy produced by vibrating matter.
    • Sound is mechanical energy that needs a medium to propagate.
    • Sound can travel through a medium which is solid, liquid or gas.
  • In terms of physiology and psychology, sound is the reception of these sound waves and their procession by the brain.
    • Sound waves arrive through the ears of the receptor.

A sound wave is generated by some vibrating source, propagates through a medium as a series of compressions and rarefactions, and is finally received by our ears and brain.

 

Characteristics of sound waves

  • Velocity
    • The speed of sound waves is NOT alwys the same.
    • A speed sound depends on the elasticity, density, and temperature of the medium the sound is traveling through.
      • For example, at 0 degree centigrade, the speed of sound in Hz (hertz) is about 331 m/s (meter per second).
      • Note that the MORE dense the material, the faster sound travels through it.
        Sound travels SLOWER through the air at 30 degrees because the air is LESS dense at 30 degrees.
  • Wevelength and Amplitude
    • What happens to the air molecules when making sounds? The vibration produces areas in which the particles are closer together and areas in which they are further apart. 
      • The vibration creates in the surrounding area a series of alternative high pressured regions called compressions (the region of air where the molecules have been compressed together) and low pressure regions called rarefactions (de-compressions) that travel away at a certain speed.
      • The air molecules vibrate back and forth, but they do not travel with the wave. Sound waves trafer energy but not matter. It means the wave energe travels in this direction (propagation), not the matter.
    • Sound waves can be represented as a function which ranges over particle density or pressure values across the domain of distance.
      • The wavelength of a sound wave is the distance between two successive crests (or troughs) of the wave.
      • The amplitude of a sound wave is the maximum change in pressure or density that the vibrating object produces in the surrounding air.
      • The pressure is measured in pascals (Pa). Although for practical reasons the dB SPL scale is usually used for measuring sound amplitude.
  • Frequency and Time Period
    • Frequency is the number of times per second that a sound pressure wave repeats itself. These repetitions are known as cycles. Frequency is measured in hertz (Hz) or cycles per second.
      • The diagram representing the upper wave contains more cycles per unit of time.
    • Time Period is the duration of H cycles. Time Period is the time a sound wave takes to go through a compression-rarefaction cycle.
    • Fomulas:
      • The period (T) is the inverse of the frequency (f):
        As the period gets smaller, the frequency gets larger, and as the period gets larger, the frequency gets smaller.
      • There is also a direction relation between sound speed (ν), wavelength (γ) and frequency (f):

 

 

In order to determine the properties for a given sound, it is useful to use the waveform view of sound. The waveform view is a graph of the change in air pressure at a particular location over time due to a compression wave. The waveform view is a physical representation.

 

Human sound perception

Physical properties Perceptual perperties
Frequency Pitch
Amplitude Loudness
Waveform Timbre
Wavelength  
Time period  
Duration  

What is the relation between the physical properties of sound and the psychological (or perceptual) properties of sound such as pitch, loadness, and timbre:

  • Pitch is the equality that makes it possible to classify sounds as higher and lower.
    • The physical property that is related to pitch is frequency.
  • Loudness is the quality that makes it possible to order sounds on a scale from quiet to loud.
    • The amplitude of sound waves is related to the perception of loudness.
  • Timbre, also known as tone colour or tone quality, describes those characteristic of sound which allow the ear to distinguish sounds which have the same pitch and loudness.
    • The waveform is related to the perception of timbre.

In order to use the physical waveform view to understand something about these perceptual properties, we need to identify physical properties that are related to them. However, this is not so simple!

  1. First, the relationship between the physical properties of a sound wave and the way we perceive it is non-linear. For example, a change in frequency does not always correspond with that constant change in pitch.
  2. Second, the way all these properties are related to each other is not so simple. For example, the frequency is rate to pitch, but the frequency also affects the loudness, the frequency also affects to the timbre, the amplitude affects to the pitch, the wave form affects to the pitch, the duration affects to the pitch and the timbre. In fact, all these properties are related to each other.

The basic concept for understanding processes such as the digitisation of a sound wave or the compression of a sound file:

  • Pure tone
    : Several experiments about human sound perception are based on pure tones. The real-world sounds are not pure tones. Pure tones can only be produced technologically.
     A pure tone is a sound that can be represented by a sinusoidal waveform that is a sine wave of constant frequency, phase-shift, and amplitude. It is composed of a single frequency.

Perception of pitch

Frequency is perceived by humans as speech.

  • A high frequency sound wave corresponds to a high pitch sound.
  • A low frequency sound wave to a low pitch sound.

As described in the figure above, the relationship between pitch and frequency is not as simple linear one for frequencies above 1,000 hertz, greater change in frequency is needed to produce a corresponding change in pitch.

  • There are a wide range of frequencies that occur in the world, humans cannot hear all the sound waves that arrive to our ears. The frequency range of human hearing is about 20 to 20,000 hertz.

Perception of loudness

Loudness is a sensation related to the amplitude of sound waves.


To express sound amplitude in terms of pascals, we have to deal with numbers from as small as 20 to as big as 20 millions.

Our ears perceive sound intensity on a logarithmic scale, which is why sound pressure is measured in decibels (dB), specifically dB SPL (decibels of sound pressure level). This logarithmic scale makes more sense for our hearing than a linear one. The formula expresses sound pressure in dB SPL. For example, we have a sound pressure of 20,000 micropascals.

$$ SPL=20log_{10}(\frac{2,000}{20})dB=20*3=60 $$

Dividing this by a reference pressure (usually 20 micropascals) gives us 1,000. Taking the logarithm (base 10) of 1,000 and multiplying by 20 (because we're using the dB scale) gives us approximately 60 dB SPL.

The relation between the subjective quality of loudness and the physical quantity of sound pressure level is complex. This graph is called an equal loudness contour and it shows the sound pressure level required at different frequencies to achieve a consistent perceived loudness. Each curve on the chart represents a curve of equal loudness of pure tones. Two important things are:

  1. The ear is more sensitive to high-mid frequencies than to bass frequencies. In general, humans can hear sounds at lower decibel levels between 3000 and 5000 hertz than any other frequency.
  2. The human ear interpret changes in loudness within a logarithmic scale.

The quietest sound we can possibly hear is given as 0 dB SPL and is referred to as the threshold of hearing. The "0" does not mean that there is no pressure in the sound wave. The loudest sound that we can hear is approximately 120 dB SPL and is referred to as the threshold of pain. Anything above this is both physically painful and damaging to our hearing.

 

Perception of timbre

Timbre or tone quality is what differentiates two sounds of the same frequency and amplitude.

  • The two sound graphs have the same frequency and amplitude, yet they differ. They have different timbre!

The perceptual property of timbre is related to the physical properties of the waveform and the spectrum of sound. Timbre is influenced by the shape of the waveform as well as the spectral characteristics. For instance, the spectrum of a pure tone contributes to its timbral qualities.

These are other waveforms. All of them are similar, but all of them are different. They have the same amplitude and the same frequency, but the sound is different because they have tone quality or different timbre.

 

'IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w04) Filtering  (0) 2024.05.01
(w03) Audio processing  (0) 2024.04.26
(w01) Digitising audio signals  (0) 2024.04.11

+ Recent posts