Fairness in game playing AI: six key dimensions

  1. Perceptual fairness
    : Do both competitors perceive the game environment in the same way? This refers to the information they receive about the game (the same input space).
  2. Motoric fairness
    : Do both competitors have the same capabilities to take actions within the game (the same output space)? This includes limitations or advantages in movement, available options, or control schemes.
  3. Historic fairness
    : Do both AI system have the same amount of time and data for training? This ensures a level playing field by avoiding an advantage for systems with more extensive training data.
  4. Knowledge fairness
    : Do both competitors have access to the same in-game knowledge? This refers to understanding the game's rules, objectives, and potentially strategies if applicable.
  5. Computational fairness
    : Do both AI systems have the same processing power for decision-making? This ensures neither system has a significant advantage in terms of computational speed or resources.
  6. Common-sense fairness
    : Do both AI have access to the same background knowledge beyond the specific game? This includes common-sense reasoning that could influence gameplay decisions.

Isaac Asimov's three laws of robotics:

  1. The First Law
    : A robot may not injure a human being or, through inaction, allow a human being to come to harm.
    → This law prioritises human safety above all else.
  2. The Second Law
    : A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
    → Robots are programmed to follow human instructions, but not at the expense of harming humans.
  3. The Third Law
    : A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
    → Robots are given a basic instinct for self-preservation, but overridden by the higher priorities of protecting humans and following orders.

 

'UoL > ArtificialIntelligence' 카테고리의 다른 글

(w06) Bio-inspired computing (BIC)  (0) 2024.05.15
(w02) Markov Decision Process and Deep Q Network  (0) 2024.04.16

Signal averaging is a signal processing technique that tries to remove unwanted random disturbances from a signal through the process of averaging.

  • Averaging often takes the form of summing a series of signal samples and then dividing that sum by the number of individual samples.

The following equation represents a N-point moving average filter, with input the array x and outputs the averaged array y:

y(n)=1Nk=0N1x(nk)

 

Implementing in Python:

### 1. Simple example
import numpy as np

values = np.array([3., 9., 3., 4., 5., 2., 1., 7., 9., 1., 3., 5., 4., 9., 0., 4., 2., 8., 9., 7.])
N = 3

averages = np.empty(len(values))
for i in range(1, len(values)-1):
    averages[i] = (values[i-1]+values[i]+values[i+1])/N

# Preserve the edge values
averages[0] = values[0]
averages[len(values)-1] = values[len(values)-1]
### 2. Use numpy.convolve
window = np.ones(3)
window /= sum(window)
averages = np.convolve(values, window, mode='same')
### 3. Use scipy.ndimage.uniform_filter1d
from scipy.ndimage.filters import uniform_filter1d
averages = uniform_filter1d(values, size=3)

 

Averaging low-pass filter

In signal processing, the moving average filter can be used as a simple low-pass filter. The moving average filter smooths out a signal, removing the high frequency components from it, and this is what a low-filter does!

 

FIR (Finite Impulse Response) filers

In signal processing, a FIR filer is a filter whose impulse response (or response to any finite length input) is of finite duration, because it settles to zero in finite time. For a general N-tap FIR filter, the nth output is:

y(n)=k=0N1h(k)x(nk)

h(n)=1N

n=0,1,...,N

This fomula has already been used above, since the moving average filter is a kind of FIR filter.

 

Implementing in Python:

import numpy as np
from thinkdsp import SquareSignal, Wave

# suppress scientific notation for small numbers
np.set_printoptions(precision=3, suppress=True)

# The wave to be filtered
from thinkdsp import read_wave
my_sound = read_wave('../Audio/429671__violinsimma__violin-carnatic-phrase-am.wav')
my_sound.make_audio()

# Make a 5-tap FIR filter using the following coefficients: 0.1, 0.2, 0.2, 0.2, 0.1
window = np.array([0.1, 0.2, 0.2, 0.2, 0.1])

# Apply the window to the signal using np.convolve
filtered = np.convolve(my_sound.ys, window, mode='same')
filtered_violin = Wave(filtered, framerate=my_sound.framerate)
filtered_violin.make_audio()

 

LTI (Linear Time Invariant) systems

  • It it happens to be a LTI system, we can represent its behaviour as a list of numbers known as an IMPULSE RESPONSE.

An impulse response is the response of an LTI system to the impulse signal.

  • An impulse is one single maximum amplitude sample.

Example of an impulse:

There is one stalk that is reaching up to 0.

Example of an impulse response:

It is a bunch of stalks (a set of numbers).

Given an impulse response, we can easily process any signal with that system using convolution.

 

  • We can derive the output of a discrete linear system, by adding together the system's response to each input sample separately. This operation is known as convolution.

y[n]=x[n]h[n]=m=0x[m]h[nm]

※ The convolution operator is indicated by the '*' operator

 

Three characteristics of LTI systems

Linear systems have very specific characteristics which enable us to do the convolution:

  1. Homogeneity (or linear with respect to scale)
    : Multiply the signal by 0.5 (scale it by 0.5), shove both the signals through the systemsl, and get the outputs
    1) Convolve the signal with the system
    2) Receive the output
    → It doesn't matter if the signal is scaled becuse we know tht it will produce the same scaled output.
  2. Additivity (decompose)
    : Separately process simple signals and add results together
  3. Shift invariance
    : Shift a signal across (e.g. delay by one unit)

Implement an impulse response by hand:

  • Signal = [1.0, 0.75, 0.5, 0.75, 1.0]
  • System = [0.0, 1.0, 0.75, 0.5, 0.25]
    • Decompose:
      • input = [0.0, 0.0, 0.0, 0.0, 0.0]
      • input = [0.0, 1.0, 0.0, 0.0, 0.0]
      • input = [0.0, 0.0, 0.75, 0.0, 0.0]
      • input = [0.0, 0.0, 0.0, 0.5, 0.0]
      • input = [0.0, 0.0, 0.0, 0.0, 0.25]
    • Scale:
      • output = [0.0, 0.0, 0.0, 0.0, 0.0]
      • output = [1.0, 0.75, 0.5, 0.75, 1.0]
      • output = [0.75, 0.5625, 0.375, 0.5625, 0.75]
      • output = [0.5, 0.375, 0.25, 0.375, 0.5]
      • output = [0.25, 0.1875, 0.125, 0.1875, 0.25]
    • Shift:
      • output = [0.0, 0.0, 0.0, 0.0, 0.0]
      • output = [0.0, 1.0, 0.75, 0.5, 0.75, 1.0] // delay by one unit
      • output = [0.0, 0.0, 0.75, 0.5625, 0.375, 0.5625, 0.75] // delay by two units
      • output = [0.0, 0.0, 0.0, 0.5, 0.375, 0.25, 0.375, 0.5] // delay by three units
      • output = [0.0, 0.0, 0.0, 0.0, 0.25, 0.1875, 0.125, 0.1875, 0.25] // delay by four units
    • Synthesise (add the components back together):
      • output (result) = [0.0, 1.0, 1.5, 1.5625, 1.75, 2.0, 1.25, 0.6875, 0.25]

Implement in Python:

import numpy as np

def convolve(signal, system):
    rst = np.zeros(len(signal) + len(system) - 1)
    for sig_idx in range(len(signal)):
        sygval = signal[sig_idx]
        for sys_idx in range(len(system)):
            sysval = system[sys_idx]
            scaled = sygval * sysval
            out_idx = sig_idx + sys_idx
            rst[out_idx] += scaled
    return rst

 

'UoL > IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w03) Audio processing  (0) 2024.04.26
(w01) Digitising audio signals  (0) 2024.04.11
(w01) Audio fundamentals  (0) 2024.04.11

A simple scalar example of audio processing:

  • Amplitude is on the y-axis.
  • Time is on the x-axis.

Normalisation in audio signals allows us to adjust the volume (amplitude) of the entire signal.

  • We can change the size of the amplitude in a proportionate way.

Normalisation in audio signals is a bit simpler than statistical normalisation. It involves two phases: analysis and scaling.

  1. Analysis phase
    : In this phase, the signal is analysed to find the peak, or the loudest sample. This is essentially a peak-finding algorithm that identifies the highest amplitude in the waveform.
  2. Scaling phase
    : Once the peak is found, the algorithm calculates how much gain can be applied to the entire signal without causing clipping (distortion). This gain is then applied uniformly to the entire signal.

Linear ramps: fading in and out

  • Fade in
    : It starts with the scalar zero so that mutes the signal and then gradually as we go through that range, we're increasing the scalar up to one when it hits which would make no change to the signal so effectively turns back to the original signal.
  • Fade out
    : It starts out with a high scalar at the beginning of the array of numbers that we're going to process. As the effect, as we've iterated over the numbers in the array, we will reduce that scalar down to zero and then obviously that would sound like the signal getting quieter.

 

'UoL > IntelligentSignalProcessing' 카테고리의 다른 글

(w10) Offline ASR (Automatic Speech Recognition) system  (0) 2024.06.11
(w06) Complex synthesis  (0) 2024.05.14
(w04) Filtering  (0) 2024.05.01
(w01) Digitising audio signals  (0) 2024.04.11
(w01) Audio fundamentals  (0) 2024.04.11

In the field of reinforcement learning, the action policy is a mapping between states and actions, denoted by the Greek letter 'π' (pi). This means that the policy, given state s, will recommend taking action a: π(s) → a.

  • State: s
  • Action: a
  • Next state: s'

A Markov Decision Process is a way of formalising a stochastic sequential decision problem.

  • State transitions: P(s' | s, a)
  • Reward function: R(s, a, s')

Formalising is basically expressing something in a clear and mathematical way. It can then be used to build algorithms and a stochastic sequential decision problem. 

  • Stochastic means it has a probabilistic element. It is sarcastic because it might go to other states probabilistically based on what the current state is.

 

Unlike a one-time decision, an optimal policy in reinforcement learning considers the long-term reward. It aims to maximise the total reward accumulated over a sequence of actions, even if some rewards come much later.

 

Bellman equation(ish) defining the value of a given action in a given state based on future reward.

The value of action a in state s is the reward for s + max possible future reward for states s at time t, increasingly discounted by gamma (γ), discount factor, raised to the power of t.

  • Gamma (γ) is less than 1.
    • Assuming γ=0.5, then γ squared would be 0.25. So we're only taking a quart in a quarter of the reward. We're not confident of what's going to happen in the future, so we're going to take a little bit of the reward that comes in the future.

Note that the probability of state transitions are not included here.

 

What are the future states and rewards?

  • The state transition matrix describes how the environment reacts to the chosen actions (how the state will change over time based on the chosen actions). It tells us the probability of reaching different states after taking specific actions in the current state.
  • The action policy, on the other hand, guides the decision-making. It takes the current state as input and recommends which action to take. This recommendation can be based on maximising immediate reward, long-term reward, or other criteria depending on the specific policy.

In reinforcement learning, creating an optimal action policy often requires complete knowledge of the environment. This includes knowing the transition matrix (all possible state transitions based on actions) and the rewards associated with each transition.

 

However, in most real-world scenarios, this information is incomplete. Q-learning is a technique that addresses this challenge. It focuses on learning a Q-value function, which estimates the expected future reward for taking a specific action in a particular state.

  • The goal of Q-learning is to find the optimal Q-function (Q)*, which tells us the best action to take in any given state to maximise future rewards.

There are various methods to do Q-learning, but most of them don't work for real problems. The one that works here is to approximate the value function using a deep network called Deep Q Network (DQN).

 

DQN agent architecture

  • An agent is an entity that can observe and act autonomously.
    • We need an agent architecture that solves two problems: no state transition matrix and no action policy.

We explore the game and make observations of the form: s, a, s', r, and done.

  • s = state now
  • a = action taken
  • s' = next state
  • r = reward
  • done = true/false is the game finished?

For DQN, this is the replay buffer. Over time, the agent fills up a large replay buffer. The example for one state, s1, and the three actions, a1/a2/a3 is as follows:

  1. s1, a1 → s2, r1, d0
  2. s1, a2 → s3, r2, d0
  3. s1, a3 → s4, r3, d1

The epsilon greedy exploration is the method that the agent uses to fill up the replay buffer. The acting policy is a simple and effective method for balancing exploration and exploitation by estimating the highest rewards.

  • Epsilon-greedy works by introducing a probability (epsilon, ε) of taking a random action instead of the one with the highest estimated Q-value.
    → this encourages exploration and helps the agent discover potentially better actions it might not have encountered yet

The DQN agent

  • Knows about the states and rewards
  • Acts in the game world by taking actions
  • Makes observations of what is happening in the game
  • Replay buffer consists of many observations of the actions taken by the agent in the game world and the results of those actions

In Deep Q-Networks (DQN), a crucial part of the training process is the loss function. This function helps the network learn by measuring the difference between its predictions and the desired outcome.

  • The theta (Θ) is the weight in the network.
  • Θ ̄ is the old version of the network.

To train the network, we use a technique called experience replay. We store past experiences (state, action, reward, next state) in a replay buffer (D). During training, we uniformly sample a mini-batch of these experiences, denoted by U(D), to create a training set. This training set feeds the network and helps it learn from various past experiences.

'UoL > ArtificialIntelligence' 카테고리의 다른 글

(w06) Bio-inspired computing (BIC)  (0) 2024.05.15
(w05) Ethics of game playing AI  (0) 2024.05.06

+ Recent posts