3.13. AlphaZero¶

This notebook is based on the paper:

with additional insight from:

This code use the new conx layer that sits on top of Keras. Conx is designed to be simpler than Keras, more intuitive, and integrated visualizations.

Currently this code requires the TensorFlow backend, as it has a function specific to TF.

3.13.1. The Game¶

First, let’s look at a specific game. We can use many, but for this demonstration we’ll pick ConnectFour. There is a good code base of different games and a game engine in the code based on Artificial Intelligence: A Modern Approach.

If you would like to install aima3, you can use something like this in a cell:

! pip install aima3 -U --user


aima3 has other games that you can play as well as ConnectFour, including TicTacToe. aima3 has many AI algorithms wrapped up to play games. You can see more details about the game engine and ConnectFour here:

and other resources in that repository.

We import some of these that will be useful in our AlphaZero exploration:

In [1]:

from aima3.games import (ConnectFour, RandomPlayer,
MCTSPlayer, QueryPlayer, Player,
MiniMaxPlayer, AlphaBetaPlayer,
AlphaBetaCutoffPlayer)
import numpy as np


Let’s make a game:

In [2]:

game = ConnectFour()


and play a game between two random players:

In [3]:

game.play_game(RandomPlayer("Random-1"), RandomPlayer("Random-2"))

Random-2 is thinking...
Random-2 makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
Random-1 is thinking...
Random-1 makes action (5, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . X . .
Random-2 is thinking...
Random-2 makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . X X .
Random-1 is thinking...
Random-1 makes action (5, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . O . .
. . . . X X .
Random-2 is thinking...
Random-2 makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . O . .
X . . . X X .
Random-1 is thinking...
Random-1 makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . O . .
X . O . X X .
Random-2 is thinking...
Random-2 makes action (3, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . X . O . .
X . O . X X .
Random-1 is thinking...
Random-1 makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . X . O . .
X O O . X X .
Random-2 is thinking...
Random-2 makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . X . O . .
X O O X X X .
Random-1 is thinking...
Random-1 makes action (3, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . O . O . .
. . X . O . .
X O O X X X .
Random-2 is thinking...
Random-2 makes action (2, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . O . O . .
. X X . O . .
X O O X X X .
Random-1 is thinking...
Random-1 makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . O . O . .
. X X . O . .
X O O X X X O
Random-2 is thinking...
Random-2 makes action (3, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . .
. . O . O . .
. X X . O . .
X O O X X X O
Random-1 is thinking...
Random-1 makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . .
. . O . O . .
. X X . O . O
X O O X X X O
Random-2 is thinking...
Random-2 makes action (5, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . X . X . .
. . O . O . .
. X X . O . O
X O O X X X O
Random-1 is thinking...
Random-1 makes action (2, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . X . X . .
. O O . O . .
. X X . O . O
X O O X X X O
Random-2 is thinking...
Random-2 makes action (2, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. X X . X . .
. O O . O . .
. X X . O . O
X O O X X X O
Random-1 is thinking...
Random-1 makes action (5, 5):
. . . . . . .
. . . . . . .
. . . . O . .
. X X . X . .
. O O . O . .
. X X . O . O
X O O X X X O
Random-2 is thinking...
Random-2 makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . O . .
. X X . X . .
. O O . O . .
. X X X O . O
X O O X X X O
Random-1 is thinking...
Random-1 makes action (7, 3):
. . . . . . .
. . . . . . .
. . . . O . .
. X X . X . .
. O O . O . O
. X X X O . O
X O O X X X O
Random-2 is thinking...
Random-2 makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . O . .
. X X . X . .
. O O . O . O
X X X X O . O
X O O X X X O
***** Random-2 wins!

Out[3]:

['Random-2']


We can also play a match (a bunch of games) or even a tournament between a bunch of players.

p1 = RandomPlayer("Random-1")
p2 = MiniMax("MiniMax-1")
p3 = AlphaBetaCutoff("ABCutoff-1")

game.play_matches(10, p1, p2)

game.play_tournament(1, p1, p2, p3)


Can you beat RandomPlayer? Hope so!

Can you beat MiniMax? No! But it takes too long.

Humans enter their commands by (column, row) where column starts at 1 from left, and row starts at 1 from bottom.

In [4]:

# game.play_game(AlphaBetaCutoffPlayer("AlphaBetaCutoff"), HumanPlayer("Your Name Here"))


3.13.2. The Network¶

Net, we are going to build the same kind of network described in the AlphaZero paper.

Make sure to set your Keras backend to TensorFlow for now, as we have a function that is written at that level.

In [5]:

import conx as cx
from aima3.games import Game
from keras import regularizers

Using TensorFlow backend.
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
conx, version 3.5.13

In [6]:

## NEED TO REWRITE THIS FUNCTION IN KERAS:

import tensorflow as tf

def softmax_cross_entropy_with_logits(y_true, y_pred):
p = y_pred
pi = y_true
zero = tf.zeros(shape = tf.shape(pi), dtype=tf.float32)
where = tf.equal(pi, zero)
negatives = tf.fill(tf.shape(pi), -100.0)
p = tf.where(where, negatives, p)
loss = tf.nn.softmax_cross_entropy_with_logits(labels = pi, logits = p)
return loss


3.13.2.1. Representations¶

The state board is the most important bits of information. How to represent it? Possible ideas:

• a vector of 42 values
• a 6x7 matrix

We decided to represent the state of the board as 2 6x7 matrices: one for representing the current player’s pieces, and the other for the opponent pieces.

We also need to represent actions. Possible ideas:

• 7 outputs, each representing a column to drop a piece into
• two outputs, one representing row, and the other column
• 6x7 matrix, each representing the position on the grid
• 42 outputs, each representing the position on the grid

We decided to represent them as the final option: 42 outputs.

The network architecture in AlphaZero is quite large, and has repeating blocks of layers. To help in the construction of the network, we define some functions

In [7]:

def add_conv_block(net, input_layer):
filters=75,
kernel_size=(4,4),
use_bias=False,
activation='linear',
kernel_regularizer=regularizers.l2(0.0001)))
bname = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
net.connect(input_layer, cname)
net.connect(cname, bname)
net.connect(bname, lname)
return lname

prev_layer = add_conv_block(net, input_layer)
filters=75,
kernel_size=(4,4),
use_bias=False,
activation='linear',
kernel_regularizer=regularizers.l2(0.0001)))
bname = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
net.connect(prev_layer, cname)
net.connect(cname, bname)
net.connect(input_layer, aname)
net.connect(bname, aname)
net.connect(aname, lname)
return lname

filters=1,
kernel_size=(1,1),
use_bias=False,
activation='linear',
kernel_regularizer=regularizers.l2(0.0001)))
l2 = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
20,
use_bias=False,
activation='linear',
kernel_regularizer=regularizers.l2(0.0001)))
1,
use_bias=False,
activation='tanh',
kernel_regularizer=regularizers.l2(0.0001)))
net.connect(input_layer, l1)
net.connect(l1, l2)
net.connect(l2, l3)
net.connect(l3, l4)
net.connect(l4, l5)
net.connect(l5, l6)
net.connect(l6, l7)
return l7

filters=2,
kernel_size=(1,1),
use_bias=False,
activation='linear',
kernel_regularizer = regularizers.l2(0.0001)))
l2 = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
42,
use_bias=False,
activation='linear',
kernel_regularizer=regularizers.l2(0.0001)))
net.connect(input_layer, l1)
net.connect(l1, l2)
net.connect(l2, l3)
net.connect(l3, l4)
net.connect(l4, l5)
return l5

In [8]:

def make_network(game, residuals=5):
net = cx.Network("AlphaZero Network")
net.add(cx.Layer("main_input", (game.v, game.h, 2)))
out_layer = add_conv_block(net, "main_input")
for i in range(residuals):
out_layer = add_residual_block(net, out_layer)
optimizer=cx.SGD(lr=0.1, momentum=0.9),
for layer in net.layers:
if layer.kind() == "hidden":
layer.visible = False
return net

In [9]:

game = ConnectFour()
net = make_network(game)

In [10]:

net.model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
main_input (InputLayer)         (None, 6, 7, 2)      0
__________________________________________________________________________________________________
conv2d-1 (Conv2D)               (None, 6, 7, 75)     2400        main_input[0][0]
__________________________________________________________________________________________________
batch-norm-1 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-1[0][0]
__________________________________________________________________________________________________
leaky-relu-1 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-1[0][0]
__________________________________________________________________________________________________
conv2d-2 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-1[0][0]
__________________________________________________________________________________________________
batch-norm-2 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-2[0][0]
__________________________________________________________________________________________________
leaky-relu-2 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-2[0][0]
__________________________________________________________________________________________________
conv2d-3 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-2[0][0]
__________________________________________________________________________________________________
batch-norm-3 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-3[0][0]
__________________________________________________________________________________________________
add-1 (Add)                     (None, 6, 7, 75)     0           leaky-relu-1[0][0]
batch-norm-3[0][0]
__________________________________________________________________________________________________
leaky-relu-3 (LeakyReLU)        (None, 6, 7, 75)     0           add-1[0][0]
__________________________________________________________________________________________________
conv2d-4 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-3[0][0]
__________________________________________________________________________________________________
batch-norm-4 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-4[0][0]
__________________________________________________________________________________________________
leaky-relu-4 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-4[0][0]
__________________________________________________________________________________________________
conv2d-5 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-4[0][0]
__________________________________________________________________________________________________
batch-norm-5 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-5[0][0]
__________________________________________________________________________________________________
add-2 (Add)                     (None, 6, 7, 75)     0           leaky-relu-3[0][0]
batch-norm-5[0][0]
__________________________________________________________________________________________________
leaky-relu-5 (LeakyReLU)        (None, 6, 7, 75)     0           add-2[0][0]
__________________________________________________________________________________________________
conv2d-6 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-5[0][0]
__________________________________________________________________________________________________
batch-norm-6 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-6[0][0]
__________________________________________________________________________________________________
leaky-relu-6 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-6[0][0]
__________________________________________________________________________________________________
conv2d-7 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-6[0][0]
__________________________________________________________________________________________________
batch-norm-7 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-7[0][0]
__________________________________________________________________________________________________
add-3 (Add)                     (None, 6, 7, 75)     0           leaky-relu-5[0][0]
batch-norm-7[0][0]
__________________________________________________________________________________________________
leaky-relu-7 (LeakyReLU)        (None, 6, 7, 75)     0           add-3[0][0]
__________________________________________________________________________________________________
conv2d-8 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-7[0][0]
__________________________________________________________________________________________________
batch-norm-8 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-8[0][0]
__________________________________________________________________________________________________
leaky-relu-8 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-8[0][0]
__________________________________________________________________________________________________
conv2d-9 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-8[0][0]
__________________________________________________________________________________________________
batch-norm-9 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-9[0][0]
__________________________________________________________________________________________________
add-4 (Add)                     (None, 6, 7, 75)     0           leaky-relu-7[0][0]
batch-norm-9[0][0]
__________________________________________________________________________________________________
leaky-relu-9 (LeakyReLU)        (None, 6, 7, 75)     0           add-4[0][0]
__________________________________________________________________________________________________
conv2d-10 (Conv2D)              (None, 6, 7, 75)     90000       leaky-relu-9[0][0]
__________________________________________________________________________________________________
batch-norm-10 (BatchNormalizati (None, 6, 7, 75)     24          conv2d-10[0][0]
__________________________________________________________________________________________________
leaky-relu-10 (LeakyReLU)       (None, 6, 7, 75)     0           batch-norm-10[0][0]
__________________________________________________________________________________________________
conv2d-11 (Conv2D)              (None, 6, 7, 75)     90000       leaky-relu-10[0][0]
__________________________________________________________________________________________________
batch-norm-11 (BatchNormalizati (None, 6, 7, 75)     24          conv2d-11[0][0]
__________________________________________________________________________________________________
add-5 (Add)                     (None, 6, 7, 75)     0           leaky-relu-9[0][0]
batch-norm-11[0][0]
__________________________________________________________________________________________________
leaky-relu-11 (LeakyReLU)       (None, 6, 7, 75)     0           add-5[0][0]
__________________________________________________________________________________________________
conv2d-13 (Conv2D)              (None, 6, 7, 1)      75          leaky-relu-11[0][0]
__________________________________________________________________________________________________
batch-norm-13 (BatchNormalizati (None, 6, 7, 1)      24          conv2d-13[0][0]
__________________________________________________________________________________________________
conv2d-12 (Conv2D)              (None, 6, 7, 2)      150         leaky-relu-11[0][0]
__________________________________________________________________________________________________
leaky-relu-13 (LeakyReLU)       (None, 6, 7, 1)      0           batch-norm-13[0][0]
__________________________________________________________________________________________________
batch-norm-12 (BatchNormalizati (None, 6, 7, 2)      24          conv2d-12[0][0]
__________________________________________________________________________________________________
flatten-2 (Flatten)             (None, 42)           0           leaky-relu-13[0][0]
__________________________________________________________________________________________________
leaky-relu-12 (LeakyReLU)       (None, 6, 7, 2)      0           batch-norm-12[0][0]
__________________________________________________________________________________________________
dense-1 (Dense)                 (None, 20)           840         flatten-2[0][0]
__________________________________________________________________________________________________
flatten-1 (Flatten)             (None, 84)           0           leaky-relu-12[0][0]
__________________________________________________________________________________________________
leaky-relu-14 (LeakyReLU)       (None, 20)           0           dense-1[0][0]
__________________________________________________________________________________________________
policy_head (Dense)             (None, 42)           3528        flatten-1[0][0]
__________________________________________________________________________________________________
value_head (Dense)              (None, 1)            20          leaky-relu-14[0][0]
==================================================================================================
Total params: 907,325
Trainable params: 907,169
Non-trainable params: 156
__________________________________________________________________________________________________

In [11]:

len(net.layers)

Out[11]:

51

In [12]:

net.render()

Out[12]:


3.13.3. Connecting the Network to the Game¶

First, we need a mapping from game (x,y) moves to a position in a list of actions and probabilities.

In [13]:

def make_mappings(game):
"""
Get a mapping from game's (x,y) to array position.
"""
move2pos = {}
pos2move = []
position = 0
for y in range(game.v, 0, -1):
for x in range(1, game.h + 1):
move2pos[(x,y)] = position
pos2move.append((x,y))
position += 1
return move2pos, pos2move


We use the connectFour game, defined above:

In [14]:

move2pos, pos2move = make_mappings(game)

In [15]:

move2pos[(2,1)]

Out[15]:

36

In [16]:

pos2move[35]

Out[16]:

(1, 1)


Need a method of converting a list of state moves into an array:

In [17]:

def state2array(game, state):
array = []
to_move = game.to_move(state)
for y in range(game.v, 0, -1):
for x in range(1, game.h + 1):
item = state.board.get((x, y), 0)
if item != 0:
item = 1 if item == to_move else -1
array.append(item)
return array

In [18]:

cx.shape(state2array(game, game.initial))

Out[18]:

(42,)


So, state2array returns a list of 42 numbers, where:

• 0 represents an empty place
• 1 represents one of my pieces
• -1 represents one of my opponent’s pieces

Note that “my” and “my opponent” may swap back and forth depending on perspective (ie, whose turn it is, as determined by game.to_move(state)).

In [19]:

def state2inputs(game, state):
board = np.array(state2array(game, state)) # 1 is my pieces, -1 other
currentplayer_position = np.zeros(len(board), dtype=np.int)
currentplayer_position[board==1] = 1
other_position = np.zeros(len(board), dtype=np.int)
other_position[board==-1] = 1
position = np.array(list(zip(currentplayer_position,other_position)))
inputs = position.reshape((game.v, game.h, 2))
return inputs.tolist()


We need to convert the state’s board into a form for the neural network:

In [20]:

state2inputs(game, game.initial)

Out[20]:

[[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]]


We can check to see if this is correct by propagating the activations to the first layer.

Initial board state has no pieces on the board:

In [21]:

state = game.initial
net.propagate_to_features("main_input", state2inputs(game, state))

Out[21]:

 Feature 0 Feature 1

Now we make a move to (1,1). But note that after the move, it is now the other player’s move. So the first move is seen on the opponent’s board (the right side, feature #1):

In [22]:

state = game.result(game.initial, (1,1))
net.propagate_to_features("main_input", state2inputs(game, state))

Out[22]:

 Feature 0 Feature 1

Now, the second player moves to (2,1). Now we are back to the original perspective, and so the right-hand board is on the left, because that is now the current player’s perspective.

In [23]:

state = game.result(state, (3,1))
net.propagate_to_features("main_input", state2inputs(game, state))

Out[23]:

 Feature 0 Feature 1

Finally, we are ready to connect the game to the network. We define a function get_predictions that takes a game and state, and propagates it through the network returning a (value, probabilities, allowedActions). The probabilities are the pi list from the AlphaZero paper.

In [24]:

def get_predictions(net, game, state):
"""
Given a state, give output of network on preferred
actions. state.allowedActions removes impossible
actions.

Returns (value, probabilties, allowedActions)
"""
board = np.array(state2array(game, state)) # 1 is my pieces, -1 other
inputs = state2inputs(game, state)
preds = net.propagate(inputs, visualize=True)
value = preds[1][0]
logits = np.array(preds[0])
allowedActions = np.array([move2pos[act] for act in game.actions(state)])
mask = np.ones(len(board), dtype=bool)
#SOFTMAX
odds = np.exp(logits)
probs = odds / np.sum(odds) ###put this just before the for?
return (value, probs.tolist(), allowedActions.tolist())

In [25]:

value, probs, acts = get_predictions(net, game, state)

In [26]:

net.snapshot(state2inputs(game, state))

Out[26]:


3.13.4. Testing Game and Network Integration¶

Finally, we turn the predictions into a move, and we can play a game with the network.

In [27]:

class NNPlayer(Player):

def set_game(self, game):
"""
Get a mapping from game's (x,y) to array position.
"""
self.net = make_network(game)
self.game = game
self.move2pos = {}
self.pos2move = []
position = 0
for y in range(self.game.v, 0, -1):
for x in range(1, self.game.h + 1):
self.move2pos[(x,y)] = position
self.pos2move.append((x,y))
position += 1

def get_predictions(self, state):
"""
Given a state, give output of network on preferred
actions. state.allowedActions removes impossible
actions.

Returns (value, probabilties, allowedActions)
"""
board = np.array(self.state2array(state)) # 1 is my pieces, -1 other
inputs = self.state2inputs(state)
preds = self.net.propagate(inputs)
value = preds[1][0]
logits = np.array(preds[0])
allowedActions = np.array([self.move2pos[act] for act in self.game.actions(state)])
mask = np.ones(len(board), dtype=bool)
#SOFTMAX
odds = np.exp(logits)
probs = odds / np.sum(odds)
return (value, probs.tolist(), allowedActions.tolist())

def get_action(self, state, turn):
value, probabilities, moves = self.get_predictions(state)
probs = np.array(probabilities)[moves]
pos = cx.choice(moves, probs)
return self.pos2move[pos]

def state2inputs(self, state):
board = np.array(self.state2array(state)) # 1 is my pieces, -1 other
currentplayer_position = np.zeros(len(board), dtype=np.int)
currentplayer_position[board==1] = 1
other_position = np.zeros(len(board), dtype=np.int)
other_position[board==-1] = 1
position = np.array(list(zip(currentplayer_position,other_position)))
inputs = position.reshape((self.game.v, self.game.h, 2))
return inputs

def state2array(self, state):
array = []
to_move = self.game.to_move(state)
for y in range(self.game.v, 0, -1):
for x in range(1, self.game.h + 1):
item = state.board.get((x, y), 0)
if item != 0:
item = 1 if item == to_move else -1
array.append(item)
return array

In [28]:

p1 = RandomPlayer("Random")
p2 = NNPlayer("NNPlayer")

In [29]:

p2.set_game(game)

In [30]:

p2.get_action(state, 2)

Out[30]:

(7, 1)

In [31]:

game.play_game(p1, p2)

Random is thinking...
Random makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . X .
NNPlayer is thinking...
NNPlayer makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O X .
Random is thinking...
Random makes action (5, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
. . . . O X .
NNPlayer is thinking...
NNPlayer makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
O . . . O X .
Random is thinking...
Random makes action (6, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X X .
O . . . O X .
NNPlayer is thinking...
NNPlayer makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X X .
O . O . O X .
Random is thinking...
Random makes action (5, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
. . . . X X .
O . O . O X .
NNPlayer is thinking...
NNPlayer makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
. . . . X X .
O . O . O X O
Random is thinking...
Random makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
. . . . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
O . . . X X .
O X O . O X O
Random is thinking...
Random makes action (2, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
O X . . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (1, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
O . . . X . .
O X . . X X .
O X O . O X O
Random is thinking...
Random makes action (1, 4):
. . . . . . .
. . . . . . .
. . . . . . .
X . . . . . .
O . . . X . .
O X . . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (1, 5):
. . . . . . .
. . . . . . .
O . . . . . .
X . . . . . .
O . . . X . .
O X . . X X .
O X O . O X O
Random is thinking...
Random makes action (1, 6):
. . . . . . .
X . . . . . .
O . . . . . .
X . . . . . .
O . . . X . .
O X . . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (3, 2):
. . . . . . .
X . . . . . .
O . . . . . .
X . . . . . .
O . . . X . .
O X O . X X .
O X O . O X O
Random is thinking...
Random makes action (3, 3):
. . . . . . .
X . . . . . .
O . . . . . .
X . . . . . .
O . X . X . .
O X O . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (3, 4):
. . . . . . .
X . . . . . .
O . . . . . .
X . O . . . .
O . X . X . .
O X O . X X .
O X O . O X O
Random is thinking...
Random makes action (2, 3):
. . . . . . .
X . . . . . .
O . . . . . .
X . O . . . .
O X X . X . .
O X O . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (3, 5):
. . . . . . .
X . . . . . .
O . O . . . .
X . O . . . .
O X X . X . .
O X O . X X .
O X O . O X O
Random is thinking...
Random makes action (3, 6):
. . . . . . .
X . X . . . .
O . O . . . .
X . O . . . .
O X X . X . .
O X O . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (2, 4):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . . . .
O X X . X . .
O X O . X X .
O X O . O X O
Random is thinking...
Random makes action (5, 4):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . X . .
O X X . X . .
O X O . X X .
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (7, 2):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . X . .
O X X . X . .
O X O . X X O
O X O . O X O
Random is thinking...
Random makes action (6, 3):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . X . .
O X X . X X .
O X O . X X O
O X O . O X O
NNPlayer is thinking...
NNPlayer makes action (4, 1):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . X . .
O X X . X X .
O X O . X X O
O X O O O X O
Random is thinking...
Random makes action (6, 4):
. . . . . . .
X . X . . . .
O . O . . . .
X O O . X X .
O X X . X X .
O X O . X X O
O X O O O X O
***** Random wins!

Out[31]:

['Random']


3.13.5. Training The Network¶

Now we are ready to train the network. The training is a clever use of Monte Carlo Tree Search, combined with playing against itself.

There is a Monte Carlo Tree Search player in aima3 that we will use. We set the policy to come from predictions from the neural network.

In [32]:

class AlphaZeroMCTSPlayer(MCTSPlayer):
"""
A Monte Carlo Tree Search with policy function from
neural network. Network will be set later to self.nnplayer.
"""
def policy(self, game, state):
# these moves are positions:
value, probs_all, moves = self.nnplayer.get_predictions(state)
if len(moves) == 0:
result = [], value
else:
probs = np.array(probs_all)[moves]
moves = [self.nnplayer.pos2move[pos] for pos in moves]
# we need to return probs and moves for game
result = [(act, prob) for (act, prob) in list(zip(moves, probs))], value
return result


The main AlphaZeroPlayer needs to be able to play in one of two modes:

• self_play: it plays against itself (using two different MCTS, as this version requires it). The network provides policy evaulation for each state is it looks ahead.
• regular play: moves come directly from the network
In [33]:

class AlphaZeroPlayer(NNPlayer):
## Load weights if continuing
def __init__(self, name, n_playout=40, *args, **kwargs):
super().__init__(name, *args, **kwargs)
self.mcts_players = [AlphaZeroMCTSPlayer("MCTS-1", n_playout=n_playout),
AlphaZeroMCTSPlayer("MCTS-2", n_playout=n_playout)]

def set_game(self, game):
super().set_game(game)
self.mcts_players[0].set_game(game)
self.mcts_players[1].set_game(game)
self.mcts_players[0].nnplayer = self
self.mcts_players[1].nnplayer = self
self.data = [[], []]
self.cache = {}

def get_action(self, state, turn, self_play):
if self_play:
## Only way to determine which is which?
if turn in self.cache:
player_num = 1
else:
player_num = 0
self.cache[turn] = True
## now use the policy to get some probs:
move, pi = self.mcts_players[player_num].get_action(state, round(turn), return_prob=True)
## save the state and probs:
self.data[player_num].append((self.state2inputs(state), self.move_probs2all_probs(pi)))
return move
else:
# play the network, were're in the playoffs!
return super().get_action(state, round(turn))

def move_probs2all_probs(self, move_probs):
all_probs = np.zeros(len(self.state2array(game.initial)))
for move in move_probs:
all_probs[self.move2pos[move]] = move_probs[move]
return all_probs.tolist()


We now set up the game to play in one of the two modes.

One complication when playing itself: the system isn’t sure which one it is, and we want to separate the two plays! To keep track, we cache the turn; if we see the same turn again, then we know it is the second.

In [50]:

class AlphaZeroGame(ConnectFour):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.memory = []

def play_game(self, *players, flip_coin=False, verbose=1, **kwargs):
print(kwargs)
results = super().play_game(*players, flip_coin=flip_coin, verbose=verbose, **kwargs)
if "self_play" in kwargs and kwargs["self_play"]:
## Do not allow flipping coins when self play:
## Assumes that player1 == player2 when self-playing
assert flip_coin is False, "no coin_flip when self-playing"
## value is in terms of player 0
value = self.final_utility
for state, probs in players[0].data[0]:
self.memory.append([state, [probs, [value]]])
# also data from opponent, so flip value:
value = -value
for state, probs in players[1].data[1]:
self.memory.append([state, [probs, [value]]])
return results

In [51]:

game = AlphaZeroGame()
best_player = AlphaZeroPlayer("best_player")
current_player = AlphaZeroPlayer("current_player")


Some basic tests to make sure things are going in the right place:

In [52]:

current_player.set_game(game)
assert current_player.data == [[], []]
print(current_player.get_action(game.initial, 1, self_play=False))
assert current_player.data == [[], []]
print(current_player.get_action(game.initial, 1, self_play=True))
assert current_player.data[0] != []
print(current_player.get_action(game.initial, 1, self_play=True))
assert current_player.data[1] != []

(6, 1)
(1, 1)
(7, 1)


Sample just for testing:

In [53]:

game.play_tournament(1, best_player, best_player, verbose=1, mode="ordered", self_play=True)

Tournament to begin with 2 matches...
{'self_play': True}
best_player is thinking...
best_player makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . X . .
best_player is thinking...
best_player makes action (5, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . X . .
best_player is thinking...
best_player makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . X X .
best_player is thinking...
best_player makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . X X O
best_player is thinking...
best_player makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . X . X X O
best_player is thinking...
best_player makes action (5, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . O . .
. . X . X X O
best_player is thinking...
best_player makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . . O . .
. . X X X X O
***** best_player wins!
{'self_play': True}
best_player is thinking...
best_player makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
best_player is thinking...
best_player makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . O
best_player is thinking...
best_player makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . . X . . O
best_player is thinking...
best_player makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
X . . X . . O
best_player is thinking...
best_player makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
X . . X X . O
best_player is thinking...
best_player makes action (4, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. . . O . . .
X . . X X . O
best_player is thinking...
best_player makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. . . O . . X
X . . X X . O
best_player is thinking...
best_player makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
O . . O . . X
X . . X X . O
best_player is thinking...
best_player makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
O . . O . . X
X X . X X . O
best_player is thinking...
best_player makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
O . . O . . X
X X . X X O O
best_player is thinking...
best_player makes action (2, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
O X . O . . X
X X . X X O O
best_player is thinking...
best_player makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
O X . O . . X
X X O X X O O
best_player is thinking...
best_player makes action (4, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
. . . O . . .
O X . O . . X
X X O X X O O
best_player is thinking...
best_player makes action (3, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
. . . O . . .
O X O O . . X
X X O X X O O
best_player is thinking...
best_player makes action (1, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
X . . O . . .
O X O O . . X
X X O X X O O
best_player is thinking...
best_player makes action (3, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
X . O O . . .
O X O O . . X
X X O X X O O
best_player is thinking...
best_player makes action (6, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
X . O O . . .
O X O O . X X
X X O X X O O
best_player is thinking...
best_player makes action (3, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . O X . . .
X . O O . . .
O X O O . X X
X X O X X O O
***** best_player wins!

Out[53]:

{'DRAW': 0, 'best_player': 2}


Did we collect some history?

In [55]:

len(game.memory)

Out[55]:

25


Ok, we are ready to learn!

In [39]:

epoch = 1
while True:
print("Epoch #%s..." % epoch)
# self-play, collect data:
print("Self-play matches begin...")
results = game.play_tournament(1, best_player, best_player,
mode="ordered", self_play=True)
print("Memory size is %s" % len(game.memory))
if len(game.memory) > 10:
print("Enough to train!")
current_player.net.dataset.clear()
print("Training on ", len(current_player.net.dataset.inputs), "patterns...")
current_player.net.train()
## save dataset every once in a while
## now see which net is better:
print("Playing best vs current to see who wins the title...")
results = game.play_tournament(2, best_player, current_player,
mode="one-each", self_play=False)
if results["best_player"] < results["current_player"]:
print("current won! swapping weights")
# give the better weights to the best_player
best_player.net.set_weights(
current_player.net.get_weights())
## clear memory here?
else:
print("best won!")
epoch += 1

Epoch #1...
Self-play matches begin...
Memory size is 78
Enough to train!
Training on  78 patterns...
Evaluating initial training metrics...
Training...
|  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#    0 |   0.97740 |   0.00000 |   0.00000
#    1 |   1.09872 |   0.00000 |   0.06410
========================================================================
#    1 |   1.09872 |   0.00000 |   0.06410
Playing best vs current to see who wins the title...
best won!
Epoch #2...
Self-play matches begin...
Memory size is 112
Enough to train!
Training on  112 patterns...
Evaluating initial training metrics...
Training...
|  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#    0 |   1.02493 |   0.00000 |   0.00000
#    1 |   0.95806 |   0.00000 |   0.13393
========================================================================
#    1 |   0.95806 |   0.00000 |   0.13393
Playing best vs current to see who wins the title...
best won!
Epoch #3...
Self-play matches begin...
Memory size is 177
Enough to train!
Training on  177 patterns...
Evaluating initial training metrics...
Training...
|  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#    0 |   0.94590 |   0.00000 |   0.00000
#    1 |   0.99881 |   0.00000 |   0.00000
========================================================================
#    1 |   0.99881 |   0.00000 |   0.00000
Playing best vs current to see who wins the title...
best won!
Epoch #4...
Self-play matches begin...
Memory size is 223
Enough to train!
Training on  223 patterns...
Evaluating initial training metrics...
Training...
|  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#    0 |   0.94871 |   0.00000 |   0.00000
#    1 |   0.77693 |   0.00000 |   0.20628
========================================================================
#    1 |   0.77693 |   0.00000 |   0.20628
Playing best vs current to see who wins the title...
best won!
Epoch #5...
Self-play matches begin...
Memory size is 262
Enough to train!
Training on  262 patterns...
Evaluating initial training metrics...
Training...
|  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#    0 |   0.94315 |   0.00000 |   0.00000
#    1 |   1.08274 |   0.00000 |   0.20992
========================================================================
#    1 |   1.08274 |   0.00000 |   0.20992
Playing best vs current to see who wins the title...
current won! swapping weights

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-3b68bf42b7da> in <module>()
22             # give the better weights to the best_player
23             best_player.net.set_weights(
---> 24                 current_player.get_weights())
25             ## clear memory here?
26         else:

AttributeError: 'AlphaZeroPlayer' object has no attribute 'get_weights'

In [40]:

len(game.memory)

Out[40]:

262

In [42]:

current_player.net.dataset.load(game.memory)

In [43]:

current_player.net["policy_head"].vshape = (6,7)
current_player.net.config["show_targets"] = True

In [45]:

current_player.net.dashboard()


3.13.6. Summary¶

• Play against itself, at just the right level. Evolution-style.
• Uses search in training.