A Fourier Transform (FT) is an integral transform that takes a function as input and outputs another function that describes the extent to which various frequencies are present in the original function.

 

Discrete Fourier Transform (DFT)

  • Since the real world deals with discrete data (samples), the DFT is a crucial tool. It's the discrete version of the FT, specifically designed to analyse finite sequences of data points like those captured by computers.
  • The DFT transforms these samples into a complex-valued function of frequency called the Discrete-Time Fourier Transform (DTFT).
더보기

DFT converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency.

Discrete Cosine Transform (DCT)

  • The DCT is closely related to the DFT.  While DFT uses both sines and cosines (complex functions), DCT focuses solely on cosine functions. This makes DCT computationally simpler and often preferred for tasks where the data has even symmetry (like audio signals).
더보기

DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

Fast Fourier Transform (FFT)

  • The DFT is powerful, but calculating it directly can be computationally expensive for large datasets. This is where the Fast Fourier Transform (FFT) comes in. It's a highly optimized algorithm specifically designed to compute the DFT efficiently, especially when the data length is a power of 2 (like 16, 32, 64 etc.).
  • The FFT is not a theoretical transform. It is just a fast algorithm to implement the transforms when N=2^k.
더보기

FFT is an algorithm that computes the Discrete Fourier Transform (DFT) of a sequence, or its inverse (IDFT). IDFT is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies.

 

The first step is to move from simple synthesis to complex synthesis.

 

What's wrong with the DCT?

The properties of sinusoids (sine waves): frequency, phase, and amplitude

We need a way to store phase and amplitude in the same place. That is called a complex numbers.

  • A complex number is basically two numbers stuck together.
    • Think of them as a way of storing phase and amplitude in one place, with associated mathematics to work with the complex numbers similarly to how we work with 'simple' numbers.
      → Complex numbers have a real and an imaginary part (two numbers).

Complex numbers in Python:

 

We need a way to computer waveforms from complex numbers: the exponential function

  • numpy.exp(x) is the exponential function, or
  • e^x (where e is Euler's number: approximately 2.718281)

→ numpy.exp(1j * x) converts x into a complex number then does exp!

Simple synthesis:

def synthesise_simple(amps, fs, ts):
    args = np.outer(ts, fs)
    M = np.cos(np.pi*2 * args)
    ys = np.dot(M, amps)
    return ys

Complex synthesis:

def synthesise_complex(amps, fs, ts):
    args = np.outer(ts, fs)
    M = np.exp(1j * np.pi*2 * args) # Swap out from np.cos(np.pi*2 * args)
    ys = np.dot(M, amps)
    return ys

 

The complex analysis problem:

  • Given a signal and a set of frequencies, how can we find the amplitude and phase of each frequency component?

Simple analysis with numpy.linalg.solve and the DCT:

def analyse_simple(ys, fs, ts):
    args = np.outer(ts, fs)
    M = np.cos(np.pi*2 * args)
    amps = np.linalg.solve(M, ys)
    return amps
    
# More constrained version
def dct_iv(ys):
    N = len(ys)
    ts = (0.5 + np.arange(N)) / N
    fs = (0.5 + np.arange(N)) / 2
    args = np.outer(ts, fs)
    M = np.cos(np.pi*2 * args)
    amps = np.dot(M, ys) / (N / 2)
    return amps

Complex analysis with np.linalg.solve:

def analyse_complex(ys, fs, ts):
    args = np.outer(ts, fs)
    M = np.exp(1k * np.pi*2 * args)
    amps = np.linalg.solve(M, ys)
    return amps

def analyse_nearly_dft(ys, fs, ts):
    N = len(fs)
    args = np.outer(ts, fs)
    M = np.exp(1j * np.pi*2 * args)
    amps = M.conj().transpose().dot(ys) / N # Swap out from 'amps = np.linalg.solve(M, ys)
    return amps

Final steps for the actual DFT:

# Calculate the frequency and time matrix
def synthesis_matrix(N):
    ts = np.arange(N) / N
    fs = np.arange(N)
    args = np.outer(ts, fs)
    M = np.exp(1j * np.pi*2 * args)
    return M
    
# Transform
def dft(ys):
    N = len(ys)
    M = synthesis_matrix(N)
    amps = M.conj().transpose().dot(ys) # No more '/ N'!
    return amps

 

Fast convolution with the DFT

  • Convolving signals in the time domain is equivalent to multiplying their Fourier transforms in frequency domain.

The inverse DFT

def idft(ys):
    N = len(ys)
    M = synthesis_matrix(N)
    amps = M.dot(ys) / N
    return amps

 

'IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w04) Filtering  (0) 2024.05.01
(w03) Audio processing  (0) 2024.04.26
(w01) Digitising audio signals  (0) 2024.04.11
(w01) Audio fundamentals  (0) 2024.04.11

Models that assign probabilities to upcoming words, or sequences of words in general, are called language models (LMs).

  • Predicting upcoming words turns out that the large language models that revolutionised modern NLP are trained just by predicting words.

Why does it matter what the probability of a sentence is or how probable the next word is?

  • In many NLP applications, we can use the probability as a way to choose a better sentence or word over a less-appropriate one.

Language models can also help in augmentative and alternative communication (AAC) systems.

  • People often use such AAC devices if they are physically unable to speak or sign but can instead use eye gaze or other specific movements to select words from a menu.
  • Word prediction can be used to suggest likely words for the menu.

N-gram language model

  • An n-gram is a sequence of n words:
    • A 2-gram (bigram) is a two-word sequence of words like "please turn", "turn your", or "your homework".
    • A 3-gram (trigram) is a three word sequence of words like "please turn your", or "turn your homework".
  • The word 'n-gram' is also used to mean a probabilistic model that can estimate the probability of a word given the n-1 previous words, and thereby also to assign probabilities to entire sequences.

Suppose the history h is "its water is so transparent that" and we want to know the probability that the next word is "the":

P(the | its water is so transparent that)

One way to estimate this probability is from relative frequency counts:

  • Take a very large corpus, count the number of times we see "its water is so transparent that", and count the number of times this is followed by "the'.

Out of the times we saw the history h, how many times was it followed by the word w, as follows:

P(the | its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that)

 

To represent the probability of a particular random variable

$$ X_i $$

taking on the value "the" or

$$ P(X_i="the") $$

we will use the simplication

$$ P(the) $$

To compute probabilities of entire sequences like

$$ P(X_n|X_{1:n-1}) $$

One things we can do is decompose this probability using the chain rule of probability:

$$ P(X_1...X_n)=P(X_1)P(X_2|X_1)P(X_3|X_{1:2})...P(X_n|X_{1:n-1}) $$

$$ = \prod_{k=1}^{n}P(X_k|X_{1:k-1}) $$

Applying the chain rule to words, we get:

$$ P(w_{1:n})=P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1}) $$

$$ =\prod_{k=1}^{n}P(w_k|w_{1:k-1}) $$

The chain rule shows the linke between computing the joint probability of a sequence and computing the conditional probability of a word given previous words.

  • We could estimate the joint probability of an entire sequence of words by multiplying together a number of conditional probabilities.

We can't just estimate by counting the number of times every word occurs following every long string, because language is creative and any particular context might have never occurred before.

 

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

  • For example, the bigram model approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word.
    • Instead of computing the probability P(the | Walden Pond's water is so transparent that),
      we approximate it with the probability P(the | that)
  • When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-1}) $$

 

The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

  • Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

A general equation for a n-gram approximation to the conditional probability of the next word in a sequence:

$$ P(w_n|w_{1:n-1})\approx P(w_n|w_{n-N+1:n-1}) $$

※ N means the n-gram size (N=2 means bigrams and N=3 means trigrams)

 

An intuitive way to estimate probabilities is called maximum likelihood estimation (MLE).

  • We get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalising the counts so that they lie between 0 and 1.

To compute a particular bigram probability of a word

$$ w_n $$

given a previous word

$$ w_{n-1} $$

we'll compute the count of the bigram

$$ C(w_{n-1}w_n) $$

and normalise by the sum of all the bigrams that share the same first word

$$ w_{n-1} $$

The computation is:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{\sum _wC(w_{n-1}w)} $$

The sum of all bigram counts that start with a given word

$$ W_{n-1} $$

must be equal to the unigram for that word, so the simplified equation is as follows:

$$ P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{C(w_{n-1})} $$

 

Language model probabilities are always represented and computed in log format as log probabilities.

  • Since probabilities are less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes.

By using log probabilities instead of raw probabilities, we get numbers that are not as small.

  • Adding in log space is equivalent to multiply in linear space, so we combine log probabilities by adding them.

$$ P_1\times P_2\times P_3\times P_4=exp(log{P_1}+log{P_2}+log{P_3}+log{P_4}) $$

The result of doing all computation and storage in log space is that we only need to convert back into probabilities if we need to report them at the end; then we can just take the exp of the logprob!

 

Evaluating language models

  • Extrinsic evaluation is the end-to-end evaluation (often very expensive!)
  • Intrinsic evaluation measures the quality of a model independent of any application
    • Perplexity (sometimes abbreviated as PP or PPL): the standard intrinsic metric for measuring language model performance, both for simple n-gram language models and for the more sophisticated neutral large language models.

In order to evaluate any machine learning model, at least three distinct data sets are needed:

  1. Training set is used to learn the parameters of the model
  2. Development test set (or devset) is used to see how good the model is by testing on the test
  3. Test set is used to evaluate the model → held-out set of data, not overlapping with the training set

Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data.


The perplexity of a language model on a test set is the inverse probability of the test set (one over the probability of the test set), normalised by the nubmer of words. For this reason, it's sometimes called the per-word perplexity.

For a test set

$$ W=w_1w_2...w_N $$

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}} $$

$$ =\sqrt[N]{\frac{1}{P(w_1w_2...w_N)}} $$

Or we can use the chain rule to expand the probability of W:

$$ \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}} $$

  • Because of the inverse, the higher the probability of the word sequence, the lower the perplexity.
    → Thus the lower the perplexity of a model on the data, the better the model, and minimising perplexity is equivalent to maximising the test set probability according to the language model.

Perplexity can also be thought of as the weighted average branching factor of a language.

  • The branching factor of a language is the number of possible next words that can follow any word.

If we have an artificial deterministic language of integer numbers whose vocabulary consists of the 10 digits, in which any digit can follow any other digit, then the branching factor of that language is 10.

 

Suppose that each of the 10 digits with exactly equal probability:

$$ P=\frac{1}{10} $$

Imagine a test string of digits of length N, and again, assume that in the training set all the digits occurred with equal probability. The perplexity will be

$$ perplexity(W)=P(w_1w_2...w_N)^{-\frac{1}{N}}=(\frac{1}{10}^N)^{-\frac{1}{N}}=\frac{1}{10}^{-1}=10 $$

'NaturalLanguageProcessing > Concept' 카테고리의 다른 글

(w08) Vector semantics and embeddings  (0) 2024.05.27
(w07) Lexical semantics  (0) 2024.05.22
(w04) Regular expression  (0) 2024.04.30
(w03) Text processing fundamentals  (0) 2024.04.24
(w02) NLP evaluation -basic  (0) 2024.04.17

Fairness in game playing AI: six key dimensions

  1. Perceptual fairness
    : Do both competitors perceive the game environment in the same way? This refers to the information they receive about the game (the same input space).
  2. Motoric fairness
    : Do both competitors have the same capabilities to take actions within the game (the same output space)? This includes limitations or advantages in movement, availablr options, or control schemes.
  3. Historic fairness
    : Did both AI system have the same amount of time and data for training? This ensures a level playing field by avoiding an advantage for systems with more extensive training data.
  4. Knowledge fairness
    : Do both competitors have access to the same in-game knowledge? This refers to understanding the game's rules, objectives, and potentially strategies if applicable.
  5. Computational fairness
    : Do both AI systems have the same processing power for decision-making? This ensures neither system has a significant advantage in terms of computational speed or resources.
  6. Common-sense fairness
    : Do both AI have access to the same background knowledge beyond the specific game? This includes common-sense reasoning that could influence gameplay decisions.

Isaac Asimov's three laws of robotics:

  1. The First Law
    : A robot may not injure a human being or, through inaction, allow a human being to come to harm.
    → This law prioritises human safety above all else.
  2. The Second Law
    : A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
    → Robots are programmed to follow human instructions, but not at the expense of harming humans.
  3. The Third Law
    : A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
    → Robots are given a basic instinct for self-preservation, but overridden by the higher priorities of protecting humans and following orders.

 

Signal averaging is a signal processing technique that tries to remove unwanted random disturbances from a signal through the process of averaging.

  • Averaging often takes the form of summing a series of signal samples and then dividing that sum by the number of individual samples.

The following equation represents a N-point moving average filter, with input the array x and outputs the averaged array y:

$$ y(n)=\frac{1}{N}\sum_{k=0}^{N-1}x(n-k) $$

 

Implementing in Python:

### 1. Simple example
import numpy as np

values = np.array([3., 9., 3., 4., 5., 2., 1., 7., 9., 1., 3., 5., 4., 9., 0., 4., 2., 8., 9., 7.])
N = 3

averages = np.empty(len(values))
for i in range(1, len(values)-1):
    averages[i] = (values[i-1]+values[i]+values[i+1])/N

# Preserve the edge values
averages[0] = values[0]
averages[len(values)-1] = values[len(values)-1]
### 2. Use numpy.convolve
window = np.ones(3)
window /= sum(window)
averages = np.convolve(values, window, mode='same')
### 3. Use scipy.ndimage.uniform_filter1d
from scipy.ndimage.filters import uniform_filter1d
averages = uniform_filter1d(values, size=3)

 

Averaging low-pass filter

In signal processing, the moving average filter can be used as a simple low-pass filter. The moving average filter smooths out a signal, removing the high frequency components from it, and this is what a low-filter does!

 

FIR (Finite Impulse Response) filers

In signal processing, a FIR filer is a filter whose impulse response (or response to any finite length input) is of finite duration, because it settles to zero in finite time. For a general N-tap FIR filter, the nth output is:

$$  y(n)=\sum_{k=0}^{N-1}h(k)x(n-k) $$

$$ h(n)=\frac{1}{N} $$

$$ n=0,1,...,N $$

This fomula has already been used above, since the moving average filter is a kind of FIR filter.

 

Implementing in Python:

import numpy as np
from thinkdsp import SquareSignal, Wave

# suppress scientific notation for small numbers
np.set_printoptions(precision=3, suppress=True)

# The wave to be filtered
from thinkdsp import read_wave
my_sound = read_wave('../Audio/429671__violinsimma__violin-carnatic-phrase-am.wav')
my_sound.make_audio()

# Make a 5-tap FIR filter using the following coefficients: 0.1, 0.2, 0.2, 0.2, 0.1
window = np.array([0.1, 0.2, 0.2, 0.2, 0.1])

# Apply the window to the signal using np.convolve
filtered = np.convolve(my_sound.ys, window, mode='same')
filtered_violin = Wave(filtered, framerate=my_sound.framerate)
filtered_violin.make_audio()

 

LTI (Linear Time Invariant) systems

  • It it happens to be a LTI system, we can represent its behaviour as a list of numbers known as an IMPULSE RESPONSE.

An impulse response is the response of an LTI system to the impulse signal.

  • An impulse is one single maximum amplitude sample.

Example of an impulse:

There is one stalk that is reaching up to 0.

Example of an impulse response:

It is a bunch of stalks (a set of numbers).

Given an impulse response, we can easily process any signal with that system using convolution.

 

  • We can derive the output of a discrete linear system, by adding together the system's response to each input sample separately. This operation is known as convolution.

$$ y[n]=x[n]*h[n]=\sum_{m=0}^{\infty }x[m]h[n-m] $$

※ The convolution operator is indicated by the '*' operator

 

Three characteristics of LTI systems

Linear systems have very specific characteristics which enable us to do the convolution:

  1. Homogeneity (or linear with respect to scale)
    : Multiply the signal by 0.5 (scale it by 0.5), shove both the signals through the systemsl, and get the outputs
    1) Convolve the signal with the system
    2) Receive the output
    → It doesn't matter if the signal is scaled becuse we know tht it will produce the same scaled output.
  2. Additivity (decompose)
    : Separately process simple signals and add results together
  3. Shift invariance
    : Shift a signal across (e.g. delay by one unit)

Implement an impulse response by hand:

  • Signal = [1.0, 0.75, 0.5, 0.75, 1.0]
  • System = [0.0, 1.0, 0.75, 0.5, 0.25]
    • Decompose:
      • input = [0.0, 0.0, 0.0, 0.0, 0.0]
      • input = [0.0, 1.0, 0.0, 0.0, 0.0]
      • input = [0.0, 0.0, 0.75, 0.0, 0.0]
      • input = [0.0, 0.0, 0.0, 0.5, 0.0]
      • input = [0.0, 0.0, 0.0, 0.0, 0.25]
    • Scale:
      • output = [0.0, 0.0, 0.0, 0.0, 0.0]
      • output = [1.0, 0.75, 0.5, 0.75, 1.0]
      • output = [0.75, 0.5625, 0.375, 0.5625, 0.75]
      • output = [0.5, 0.375, 0.25, 0.375, 0.5]
      • output = [0.25, 0.1875, 0.125, 0.1875, 0.25]
    • Shift:
      • output = [0.0, 0.0, 0.0, 0.0, 0.0]
      • output = [0.0, 1.0, 0.75, 0.5, 0.75, 1.0] // delay by one unit
      • output = [0.0, 0.0, 0.75, 0.5625, 0.375, 0.5625, 0.75] // delay by two units
      • output = [0.0, 0.0, 0.0, 0.5, 0.375, 0.25, 0.375, 0.5] // delay by three units
      • output = [0.0, 0.0, 0.0, 0.0, 0.25, 0.1875, 0.125, 0.1875, 0.25] // delay by four units
    • Synthesise (add the components back together):
      • output (result) = [0.0, 1.0, 1.5, 1.5625, 1.75, 2.0, 1.25, 0.6875, 0.25]

Implement in Python:

import numpy as np

def convolve(signal, system):
    rst = np.zeros(len(signal) + len(system) - 1)
    for sig_idx in range(len(signal)):
        sygval = signal[sig_idx]
        for sys_idx in range(len(system)):
            sysval = system[sys_idx]
            scaled = sygval * sysval
            out_idx = sig_idx + sys_idx
            rst[out_idx] += scaled
    return rst

 

'IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w03) Audio processing  (0) 2024.04.26
(w01) Digitising audio signals  (0) 2024.04.11
(w01) Audio fundamentals  (0) 2024.04.11

+ Recent posts