Perceptual fairness : Do both competitors perceive the game environment in the same way? This refers to the information they receive about the game (the same input space).
Motoric fairness : Do both competitors have the same capabilities to take actions within the game (the same output space)? This includes limitations or advantages in movement, available options, or control schemes.
Historic fairness : Do both AI system have the same amount of time and data for training? This ensures a level playing field by avoiding an advantage for systems with more extensive training data.
Knowledge fairness : Do both competitors have access to the same in-game knowledge? This refers to understanding the game's rules, objectives, and potentially strategies if applicable.
Computational fairness : Do both AI systems have the same processing power for decision-making? This ensures neither system has a significant advantage in terms of computational speed or resources.
Common-sense fairness : Do both AI have access to the same background knowledge beyond the specific game? This includes common-sense reasoning that could influence gameplay decisions.
Isaac Asimov's three laws of robotics:
The First Law : A robot may not injure a human being or, through inaction, allow a human being to come to harm. → This law prioritises human safety above all else.
TheSecond Law : A robot must obey the orders given it by human beings except where such orders would conflict with the First Law. → Robots are programmed to follow human instructions, but not at the expense of harming humans.
TheThird Law : A robot must protect its own existence as long as such protection does not conflict with the First or Second Law. → Robots are given a basic instinct for self-preservation, but overridden by the higher priorities of protecting humans and following orders.
In signal processing, the moving average filter can be used as a simple low-pass filter. The moving average filter smooths out a signal, removing the high frequency components from it, and this is what a low-filter does!
FIR (Finite Impulse Response) filers
In signal processing, a FIR filer is a filter whose impulse response (or response to any finite length input) is of finite duration, because it settles to zero in finite time. For a general N-tap FIR filter, the nth output is:
This fomula has already been used above, since the moving average filter is a kind of FIR filter.
Implementing in Python:
import numpy as np
from thinkdsp import SquareSignal, Wave
# suppress scientific notation for small numbers
np.set_printoptions(precision=3, suppress=True)
# The wave to be filteredfrom thinkdsp import read_wave
my_sound = read_wave('../Audio/429671__violinsimma__violin-carnatic-phrase-am.wav')
my_sound.make_audio()
# Make a 5-tap FIR filter using the following coefficients: 0.1, 0.2, 0.2, 0.2, 0.1
window = np.array([0.1, 0.2, 0.2, 0.2, 0.1])
# Apply the window to the signal using np.convolve
filtered = np.convolve(my_sound.ys, window, mode='same')
filtered_violin = Wave(filtered, framerate=my_sound.framerate)
filtered_violin.make_audio()
LTI (Linear Time Invariant) systems
It it happens to be a LTI system, we can represent its behaviour as a list of numbers known as an IMPULSE RESPONSE.
An impulse response is the response of an LTI system to the impulse signal.
An impulse is one single maximum amplitude sample.
Example of an impulse:
There is one stalk that is reaching up to 0.
Example of an impulse response:
It is a bunch of stalks (a set of numbers).
Given an impulse response, we can easily process any signal with that system using convolution.
We can derive the output of a discrete linear system, by adding together the system's response to each input sample separately. This operation is known as convolution.
※ Theconvolutionoperator is indicated by the '*' operator
Three characteristics of LTI systems
Linear systems have very specific characteristics which enable us to do the convolution:
Homogeneity (or linear with respect to scale)
: Multiply the signal by 0.5 (scale it by 0.5), shove both the signals through the systemsl, and get the outputs 1) Convolve the signal with the system 2) Receive the output → It doesn't matter if the signal is scaled becuse we know tht it will produce the same scaled output.
Additivity (decompose)
: Separately process simple signals and add results together
Shift invariance
: Shift a signal across (e.g. delay by one unit)
Implement an impulse response by hand:
Signal = [1.0, 0.75, 0.5, 0.75, 1.0]
System = [0.0, 1.0, 0.75, 0.5, 0.25]
Decompose:
input = [0.0, 0.0, 0.0, 0.0, 0.0]
input = [0.0, 1.0, 0.0, 0.0, 0.0]
input = [0.0, 0.0, 0.75, 0.0, 0.0]
input = [0.0, 0.0, 0.0, 0.5, 0.0]
input = [0.0, 0.0, 0.0, 0.0, 0.25]
Scale:
output = [0.0, 0.0, 0.0, 0.0, 0.0]
output = [1.0, 0.75, 0.5, 0.75, 1.0]
output = [0.75, 0.5625, 0.375, 0.5625, 0.75]
output = [0.5, 0.375, 0.25, 0.375, 0.5]
output = [0.25, 0.1875, 0.125, 0.1875, 0.25]
Shift:
output = [0.0, 0.0, 0.0, 0.0, 0.0]
output = [0.0, 1.0, 0.75, 0.5, 0.75, 1.0] // delay by one unit
output = [0.0, 0.0, 0.75, 0.5625, 0.375, 0.5625, 0.75] // delay by two units
output = [0.0, 0.0, 0.0, 0.5, 0.375, 0.25, 0.375, 0.5] // delay by three units
output = [0.0, 0.0, 0.0, 0.0, 0.25, 0.1875, 0.125, 0.1875, 0.25] // delay by four units
Normalisation in audio signals allows us to adjust the volume (amplitude) of the entire signal.
We can change the size of the amplitude in a proportionate way.
Normalisation in audio signals is a bit simpler than statistical normalisation. It involves two phases: analysis and scaling.
Analysis phase : In this phase, the signal is analysed to find the peak, or the loudest sample. This is essentially a peak-finding algorithm that identifies the highest amplitude in the waveform.
Scaling phase : Once the peak is found, the algorithm calculates how much gain can be applied to the entire signal without causing clipping (distortion). This gain is then applied uniformly to the entire signal.
Linear ramps: fading in and out
Fade in : It starts with the scalar zero so that mutes the signal and then gradually as we go through that range, we're increasing the scalar up to one when it hits which would make no change to the signal so effectively turns back to the original signal.
Fade out : It starts out with a high scalar at the beginning of the array of numbers that we're going to process. As the effect, as we've iterated over the numbers in the array, we will reduce that scalar down to zero and then obviously that would sound like the signal getting quieter.
In the field of reinforcement learning, the action policy is a mapping between states and actions, denoted by the Greek letter 'π' (pi). This means that the policy, given state s, will recommend taking action a: π(s) → a.
State: s
Action: a
Next state: s'
AMarkov Decision Processis a way of formalising a stochastic sequential decision problem.
State transitions: P(s' | s, a)
Reward function: R(s, a, s')
Formalisingis basically expressing something in a clear and mathematical way. It can then be used to build algorithms and a stochastic sequential decision problem.
Stochastic means it has a probabilistic element. It is sarcastic because it might go to other states probabilistically based on what the current state is.
Unlike a one-time decision, an optimal policy in reinforcement learning considers the long-term reward. It aims to maximise the total reward accumulated over a sequence of actions, even if some rewards come much later.
Bellman equation(ish) defining the value of a given action in a given state based on future reward.
The value of action a in state s is the reward for s + max possible future reward for states s at time t, increasingly discounted by gamma (γ), discount factor, raised to the power of t.
Gamma (γ) is less than 1.
Assuming γ=0.5, then γ squared would be 0.25. So we're only taking a quart in a quarter of the reward. We're not confident of what's going to happen in the future, so we're going to take a little bit of the reward that comes in the future.
Note that the probability of state transitions are not included here.
What are the future states and rewards?
The state transition matrix describes how the environment reacts to the chosen actions (how the state will change over time based on the chosen actions). It tells us the probability of reaching different states after taking specific actions in the current state.
The action policy, on the other hand, guides the decision-making. It takes the current state as input and recommends which action to take. This recommendation can be based on maximising immediate reward, long-term reward, or other criteria depending on the specific policy.
In reinforcement learning, creating an optimal action policy often requires complete knowledge of the environment. This includes knowing the transition matrix (all possible state transitions based on actions) and the rewards associated with each transition.
However, in most real-world scenarios, this information is incomplete. Q-learning is a technique that addresses this challenge. It focuses on learning a Q-value function, which estimates the expected future reward for taking a specific action in a particular state.
The goal of Q-learning is to find the optimal Q-function (Q)*, which tells us the best action to take in any given state to maximise future rewards.
There are various methods to do Q-learning, but most of them don't work for real problems. The one that works here is to approximate the value function using a deep network called Deep Q Network (DQN).
DQN agent architecture
An agent is an entity that can observe and act autonomously.
We need an agent architecture that solves two problems: no state transition matrix and no action policy.
We explore the game and make observations of the form: s, a, s', r, and done.
s = state now
a = action taken
s' = next state
r = reward
done = true/false is the game finished?
For DQN, this is the replay buffer. Over time, the agent fills up a large replay buffer. The example for one state, s1, and the three actions, a1/a2/a3 is as follows:
s1, a1 → s2, r1, d0
s1, a2 → s3, r2, d0
s1, a3 → s4, r3, d1
The epsilon greedy exploration is the method that the agent uses to fill up the replay buffer. The acting policy is a simple and effective method for balancing exploration and exploitation by estimating the highest rewards.
Epsilon-greedy works by introducing a probability (epsilon, ε) of taking a random action instead of the one with the highest estimated Q-value. → this encourages exploration and helps the agent discover potentially better actions it might not have encountered yet
The DQN agent
Knows about the states and rewards
Acts in the game world by taking actions
Makes observations of what is happening in the game
Replay buffer consists of many observations of the actions taken by the agent in the game world and the results of those actions
In Deep Q-Networks (DQN), a crucial part of the training process is the loss function. This function helps the network learn by measuring the difference between its predictions and the desired outcome.
The theta (Θ) is the weight in the network.
Θ ̄ is the old version of the network.
To train the network, we use a technique called experience replay. We store past experiences (state, action, reward, next state) in a replay buffer (D). During training, we uniformly sample a mini-batch of these experiences, denoted by U(D), to create a training set. This training set feeds the network and helps it learn from various past experiences.