In the field of reinforcement learning, the action policy is a mapping between states and actions, denoted by the Greek letter 'π' (pi). This means that the policy, given state s, will recommend taking action a: π(s) → a.

  • State: s
  • Action: a
  • Next state: s'

A Markov Decision Process is a way of formalising a stochastic sequential decision problem.

  • State transitions: P(s' | s, a)
  • Reward function: R(s, a, s')

Formalising is basically expressing something in a clear and mathematical way. It can then be used to build algorithms and a stochastic sequential decision problem. Stochastic means it has a probabilistic element. It is sarcastic because it might go to other states probabilistically based on what the current state is.

 

Unlike a one-time decision, an optimal policy in reinforcement learning considers the long-term reward. It aims to maximise the total reward accumulated over a sequence of actions, even if some rewards come much later.

 

Bellman equation(ish) defining the value of a given action in a given state based on future reward.

The value of action a in state s is the reward for s + max possible future reward for states s at time t, increasingly discounted by gamma (γ), discount factor, raised to the power of t.

  • Gamma (γ) is less than 1.
    • Assuming γ=0.5, then γ squared would be 0.25. So we're only taking a quart in a quarter of the reward. We're not confident of what's going to happen in the future, so we're going to take a little bit of the reward that comes in the future.

Note that the probability of state transitions are not included here.

 

What are the future states and rewards?

  • The state transition matrix describes how the environment reacts to the chosen actions (how the state will change over time based on the chosen actions). It tells us the probability of reaching different states after taking specific actions in the current state.
  • The action policy, on the other hand, guides the decision-making. It takes the current state as input and recommends which action to take. This recommendation can be based on maximising immediate reward, long-term reward, or other criteria depending on the specific policy.

In reinforcement learning, creating an optimal action policy often requires complete knowledge of the environment. This includes knowing the transition matrix (all possible state transitions based on actions) and the rewards associated with each transition.

 

However, in most real-world scenarios, this information is incomplete. Q-learning is a technique that addresses this challenge. It focuses on learning a Q-value function, which estimates the expected future reward for taking a specific action in a particular state.

  • The goal of Q-learning is to find the optimal Q-function (Q)*, which tells us the best action to take in any given state to maximise future rewards.

There are various methods to do Q-learning, but most of them don't work for real problems. The one that works here is to approximate the value function using a deep network called Deep Q Network (DQN).

 

DQN agent architecture

  • An agent is an entity that can observe and act autonomously.
    • We need an agent architecture that solves two problems: no state transition matrix and no action policy.

We explore the game and make observations of the form: s, a, s', r, and done.

  • s = state now
  • a = action taken
  • s' = next state
  • r = reward
  • done = true/false is the game finished?

For DQN, this is the 'replay buffer'. Over time, the agent fills up a large replay buffer. The example for one state, s1, and the three actions, a1/a2/a3 is as follows:

  1. s1, a1 → s2, r1, d0
  2. s1, a2 → s3, r2, d0
  3. s1, a3 → s4, r3, d1

The epsilon greedy exploration is the method that the agent uses to fill up the replay buffer. The acting policy is a simple and effective method for balancing exploration and exploitation by estimating the highest rewards.

  • Epsilon-greedy works by introducing a probability (epsilon, ε) of taking a random action instead of the one with the highest estimated Q-value. This encourages exploration and helps the agent discover potentially better actions it might not have encountered yet.

The DQN agent

  • Knows about the states and rewards
  • Acts in the game world by taking actions
  • Makes observations of what is happening in the game
  • Replay buffer consists of many observations of the actions taken by the agent in the game world and the results of those actions

In Deep Q-Networks (DQN), a crucial part of the training process is the loss function. This function helps the network learn by measuring the difference between its predictions and the desired outcome.

  • The theta (Θ) is the weight in the network.
  • Θ ̄ is the old version of the network.

To train the network, we use a technique called experience replay. We store past experiences (state, action, reward, next state) in a replay buffer (D). During training, we uniformly sample a mini-batch of these experiences, denoted by U(D), to create a training set. This training set feeds the network and helps it learn from various past experiences.

'ArtificialIntelligence > Concept' 카테고리의 다른 글

(w06) Bio-inspired computing (BIC)  (0) 2024.05.15
(w05) Ethics of game playing AI  (0) 2024.05.06

+ Recent posts