3.13. AlphaZero

This notebook is based on the paper:

with additional insight from:

This code use the new conx layer that sits on top of Keras. Conx is designed to be simpler than Keras, more intuitive, and integrated visualizations.

Currently this code requires the TensorFlow backend, as it has a function specific to TF.

3.13.1. The Game

First, let’s look at a specific game. We can use many, but for this demonstration we’ll pick ConnectFour. There is a good code base of different games and a game engine in the code based on Artificial Intelligence: A Modern Approach.

If you would like to install aima3, you can use something like this in a cell:

! pip install aima3 -U --user

aima3 has other games that you can play as well as ConnectFour, including TicTacToe. aima3 has many AI algorithms wrapped up to play games. You can see more details about the game engine and ConnectFour here:

and other resources in that repository.

We import some of these that will be useful in our AlphaZero exploration:

In [1]:
from aima3.games import (ConnectFour, RandomPlayer,
                         MCTSPlayer, QueryPlayer, Player,
                         MiniMaxPlayer, AlphaBetaPlayer,
                         AlphaBetaCutoffPlayer)
import numpy as np

Let’s make a game:

In [2]:
game = ConnectFour()

and play a game between two random players:

In [3]:
game.play_game(RandomPlayer("Random-1"), RandomPlayer("Random-2"))
Random-2 is thinking...
Random-2 makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . . . . . .
Random-1 is thinking...
Random-1 makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . O . . . .
Random-2 is thinking...
Random-2 makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . O . X . .
Random-1 is thinking...
Random-1 makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . O O X . .
Random-2 is thinking...
Random-2 makes action (3, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . .
X . O O X . .
Random-1 is thinking...
Random-1 makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . .
X . O O X . O
Random-2 is thinking...
Random-2 makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . X
X . O O X . O
Random-1 is thinking...
Random-1 makes action (5, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . O . X
X . O O X . O
Random-2 is thinking...
Random-2 makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X X O . X
X . O O X . O
Random-1 is thinking...
Random-1 makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
O . X X O . X
X . O O X . O
Random-2 is thinking...
Random-2 makes action (4, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . X . . .
O . X X O . X
X . O O X . O
Random-1 is thinking...
Random-1 makes action (3, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . O X . . .
O . X X O . X
X . O O X . O
Random-2 is thinking...
Random-2 makes action (1, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X . O X . . .
O . X X O . X
X . O O X . O
Random-1 is thinking...
Random-1 makes action (3, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . O . . . .
X . O X . . .
O . X X O . X
X . O O X . O
Random-2 is thinking...
Random-2 makes action (1, 4):
. . . . . . .
. . . . . . .
. . . . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X . O O X . O
Random-1 is thinking...
Random-1 makes action (1, 5):
. . . . . . .
. . . . . . .
O . . . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X . O O X . O
Random-2 is thinking...
Random-2 makes action (2, 1):
. . . . . . .
. . . . . . .
O . . . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X X O O X . O
Random-1 is thinking...
Random-1 makes action (6, 1):
. . . . . . .
. . . . . . .
O . . . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (1, 6):
. . . . . . .
X . . . . . .
O . . . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (3, 5):
. . . . . . .
X . . . . . .
O . O . . . .
X . O . . . .
X . O X . . .
O . X X O . X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (2, 2):
. . . . . . .
X . . . . . .
O . O . . . .
X . O . . . .
X . O X . . .
O X X X O . X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (4, 4):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O . . .
X . O X . . .
O X X X O . X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (5, 3):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O . . .
X . O X X . .
O X X X O . X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (5, 4):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O O . .
X . O X X . .
O X X X O . X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (7, 3):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O O . .
X . O X X . X
O X X X O . X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (6, 2):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O O . .
X . O X X . X
O X X X O O X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (7, 4):
. . . . . . .
X . . . . . .
O . O . . . .
X . O O O . X
X . O X X . X
O X X X O O X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (7, 5):
. . . . . . .
X . . . . . .
O . O . . . O
X . O O O . X
X . O X X . X
O X X X O O X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (2, 3):
. . . . . . .
X . . . . . .
O . O . . . O
X . O O O . X
X X O X X . X
O X X X O O X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (5, 5):
. . . . . . .
X . . . . . .
O . O . O . O
X . O O O . X
X X O X X . X
O X X X O O X
X X O O X O O
Random-2 is thinking...
Random-2 makes action (3, 6):
. . . . . . .
X . X . . . .
O . O . O . O
X . O O O . X
X X O X X . X
O X X X O O X
X X O O X O O
Random-1 is thinking...
Random-1 makes action (2, 4):
. . . . . . .
X . X . . . .
O . O . O . O
X O O O O . X
X X O X X . X
O X X X O O X
X X O O X O O
***** Random-1 wins!
Out[3]:
['Random-1']

We can also play a match (a bunch of games) or even a tournament between a bunch of players.

p1 = RandomPlayer("Random-1")
p2 = MiniMax("MiniMax-1")
p3 = AlphaBetaCutoff("ABCutoff-1")

game.play_matches(10, p1, p2)

game.play_tournament(1, p1, p2, p3)

Can you beat RandomPlayer? Hope so!

Can you beat MiniMax? No! But it takes too long.

Humans enter their commands by (column, row) where column starts at 1 from left, and row starts at 1 from bottom.

In [4]:
# game.play_game(AlphaBetaCutoffPlayer("AlphaBetaCutoff"), HumanPlayer("Your Name Here"))

3.13.2. The Network

Net, we are going to build the same kind of network described in the AlphaZero paper.

Make sure to set your Keras backend to TensorFlow for now, as we have a function that is written at that level.

In [5]:
import conx as cx
from aima3.games import Game
from keras import regularizers
Using TensorFlow backend.
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
conx, version 3.5.14
In [6]:
## NEED TO REWRITE THIS FUNCTION IN KERAS:

import tensorflow as tf

def softmax_cross_entropy_with_logits(y_true, y_pred):
    p = y_pred
    pi = y_true
    zero = tf.zeros(shape = tf.shape(pi), dtype=tf.float32)
    where = tf.equal(pi, zero)
    negatives = tf.fill(tf.shape(pi), -100.0)
    p = tf.where(where, negatives, p)
    loss = tf.nn.softmax_cross_entropy_with_logits(labels = pi, logits = p)
    return loss

3.13.2.1. Representations

The state board is the most important bits of information. How to represent it? Possible ideas:

  • a vector of 42 values
  • a 6x7 matrix

We decided to represent the state of the board as 2 6x7 matrices: one for representing the current player’s pieces, and the other for the opponent pieces.

We also need to represent actions. Possible ideas:

  • 7 outputs, each representing a column to drop a piece into
  • two outputs, one representing row, and the other column
  • 6x7 matrix, each representing the position on the grid
  • 42 outputs, each representing the position on the grid

We decided to represent them as the final option: 42 outputs.

The network architecture in AlphaZero is quite large, and has repeating blocks of layers. To help in the construction of the network, we define some functions

In [7]:
def add_conv_block(net, input_layer):
    cname = net.add(cx.Conv2DLayer("conv2d-%d",
                    filters=75,
                    kernel_size=(4,4),
                    padding='same',
                    use_bias=False,
                    activation='linear',
                    kernel_regularizer=regularizers.l2(0.0001)))
    bname = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
    lname = net.add(cx.LeakyReLULayer("leaky-relu-%d"))
    net.connect(input_layer, cname)
    net.connect(cname, bname)
    net.connect(bname, lname)
    return lname

def add_residual_block(net, input_layer):
    prev_layer = add_conv_block(net, input_layer)
    cname = net.add(cx.Conv2DLayer("conv2d-%d",
        filters=75,
        kernel_size=(4,4),
        padding='same',
        use_bias=False,
        activation='linear',
        kernel_regularizer=regularizers.l2(0.0001)))
    bname = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
    aname = net.add(cx.AddLayer("add-%d"))
    lname = net.add(cx.LeakyReLULayer("leaky-relu-%d"))
    net.connect(prev_layer, cname)
    net.connect(cname, bname)
    net.connect(input_layer, aname)
    net.connect(bname, aname)
    net.connect(aname, lname)
    return lname

def add_value_block(net, input_layer):
    l1 = net.add(cx.Conv2DLayer("conv2d-%d",
        filters=1,
        kernel_size=(1,1),
        padding='same',
        use_bias=False,
        activation='linear',
        kernel_regularizer=regularizers.l2(0.0001)))
    l2 = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
    l3 = net.add(cx.LeakyReLULayer("leaky-relu-%d"))
    l4 = net.add(cx.FlattenLayer("flatten-%d"))
    l5 = net.add(cx.Layer("dense-%d",
        20,
        use_bias=False,
        activation='linear',
        kernel_regularizer=regularizers.l2(0.0001)))
    l6 = net.add(cx.LeakyReLULayer("leaky-relu-%d"))
    l7 = net.add(cx.Layer('value_head',
        1,
        use_bias=False,
        activation='tanh',
        kernel_regularizer=regularizers.l2(0.0001)))
    net.connect(input_layer, l1)
    net.connect(l1, l2)
    net.connect(l2, l3)
    net.connect(l3, l4)
    net.connect(l4, l5)
    net.connect(l5, l6)
    net.connect(l6, l7)
    return l7

def add_policy_block(net, input_layer):
    l1 = net.add(cx.Conv2DLayer("conv2d-%d",
        filters=2,
        kernel_size=(1,1),
        padding='same',
        use_bias=False,
        activation='linear',
        kernel_regularizer = regularizers.l2(0.0001)))
    l2 = net.add(cx.BatchNormalizationLayer("batch-norm-%d", axis=1))
    l3 = net.add(cx.LeakyReLULayer("leaky-relu-%d"))
    l4 = net.add(cx.FlattenLayer("flatten-%d"))
    l5 = net.add(cx.Layer('policy_head',
            42,
            use_bias=False,
            activation='linear',
            kernel_regularizer=regularizers.l2(0.0001)))
    net.connect(input_layer, l1)
    net.connect(l1, l2)
    net.connect(l2, l3)
    net.connect(l3, l4)
    net.connect(l4, l5)
    return l5
In [8]:
def make_network(game, residuals=5):
    net = cx.Network("AlphaZero Network")
    net.add(cx.Layer("main_input", (game.v, game.h, 2)))
    out_layer = add_conv_block(net, "main_input")
    for i in range(residuals):
        out_layer = add_residual_block(net, out_layer)
    add_policy_block(net, out_layer)
    add_value_block(net, out_layer)
    net.compile(loss={'value_head': 'mean_squared_error',
                  'policy_head': softmax_cross_entropy_with_logits},
            optimizer=cx.SGD(lr=0.1, momentum=0.9),
            loss_weights={'value_head': 0.5,
                          'policy_head': 0.5})
    for layer in net.layers:
        if layer.kind() == "hidden":
            layer.visible = False
    return net
In [9]:
game = ConnectFour()
net = make_network(game)
In [10]:
net.model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
main_input (InputLayer)         (None, 6, 7, 2)      0
__________________________________________________________________________________________________
conv2d-1 (Conv2D)               (None, 6, 7, 75)     2400        main_input[0][0]
__________________________________________________________________________________________________
batch-norm-1 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-1[0][0]
__________________________________________________________________________________________________
leaky-relu-1 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-1[0][0]
__________________________________________________________________________________________________
conv2d-2 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-1[0][0]
__________________________________________________________________________________________________
batch-norm-2 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-2[0][0]
__________________________________________________________________________________________________
leaky-relu-2 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-2[0][0]
__________________________________________________________________________________________________
conv2d-3 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-2[0][0]
__________________________________________________________________________________________________
batch-norm-3 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-3[0][0]
__________________________________________________________________________________________________
add-1 (Add)                     (None, 6, 7, 75)     0           leaky-relu-1[0][0]
                                                                 batch-norm-3[0][0]
__________________________________________________________________________________________________
leaky-relu-3 (LeakyReLU)        (None, 6, 7, 75)     0           add-1[0][0]
__________________________________________________________________________________________________
conv2d-4 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-3[0][0]
__________________________________________________________________________________________________
batch-norm-4 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-4[0][0]
__________________________________________________________________________________________________
leaky-relu-4 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-4[0][0]
__________________________________________________________________________________________________
conv2d-5 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-4[0][0]
__________________________________________________________________________________________________
batch-norm-5 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-5[0][0]
__________________________________________________________________________________________________
add-2 (Add)                     (None, 6, 7, 75)     0           leaky-relu-3[0][0]
                                                                 batch-norm-5[0][0]
__________________________________________________________________________________________________
leaky-relu-5 (LeakyReLU)        (None, 6, 7, 75)     0           add-2[0][0]
__________________________________________________________________________________________________
conv2d-6 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-5[0][0]
__________________________________________________________________________________________________
batch-norm-6 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-6[0][0]
__________________________________________________________________________________________________
leaky-relu-6 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-6[0][0]
__________________________________________________________________________________________________
conv2d-7 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-6[0][0]
__________________________________________________________________________________________________
batch-norm-7 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-7[0][0]
__________________________________________________________________________________________________
add-3 (Add)                     (None, 6, 7, 75)     0           leaky-relu-5[0][0]
                                                                 batch-norm-7[0][0]
__________________________________________________________________________________________________
leaky-relu-7 (LeakyReLU)        (None, 6, 7, 75)     0           add-3[0][0]
__________________________________________________________________________________________________
conv2d-8 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-7[0][0]
__________________________________________________________________________________________________
batch-norm-8 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-8[0][0]
__________________________________________________________________________________________________
leaky-relu-8 (LeakyReLU)        (None, 6, 7, 75)     0           batch-norm-8[0][0]
__________________________________________________________________________________________________
conv2d-9 (Conv2D)               (None, 6, 7, 75)     90000       leaky-relu-8[0][0]
__________________________________________________________________________________________________
batch-norm-9 (BatchNormalizatio (None, 6, 7, 75)     24          conv2d-9[0][0]
__________________________________________________________________________________________________
add-4 (Add)                     (None, 6, 7, 75)     0           leaky-relu-7[0][0]
                                                                 batch-norm-9[0][0]
__________________________________________________________________________________________________
leaky-relu-9 (LeakyReLU)        (None, 6, 7, 75)     0           add-4[0][0]
__________________________________________________________________________________________________
conv2d-10 (Conv2D)              (None, 6, 7, 75)     90000       leaky-relu-9[0][0]
__________________________________________________________________________________________________
batch-norm-10 (BatchNormalizati (None, 6, 7, 75)     24          conv2d-10[0][0]
__________________________________________________________________________________________________
leaky-relu-10 (LeakyReLU)       (None, 6, 7, 75)     0           batch-norm-10[0][0]
__________________________________________________________________________________________________
conv2d-11 (Conv2D)              (None, 6, 7, 75)     90000       leaky-relu-10[0][0]
__________________________________________________________________________________________________
batch-norm-11 (BatchNormalizati (None, 6, 7, 75)     24          conv2d-11[0][0]
__________________________________________________________________________________________________
add-5 (Add)                     (None, 6, 7, 75)     0           leaky-relu-9[0][0]
                                                                 batch-norm-11[0][0]
__________________________________________________________________________________________________
leaky-relu-11 (LeakyReLU)       (None, 6, 7, 75)     0           add-5[0][0]
__________________________________________________________________________________________________
conv2d-13 (Conv2D)              (None, 6, 7, 1)      75          leaky-relu-11[0][0]
__________________________________________________________________________________________________
batch-norm-13 (BatchNormalizati (None, 6, 7, 1)      24          conv2d-13[0][0]
__________________________________________________________________________________________________
conv2d-12 (Conv2D)              (None, 6, 7, 2)      150         leaky-relu-11[0][0]
__________________________________________________________________________________________________
leaky-relu-13 (LeakyReLU)       (None, 6, 7, 1)      0           batch-norm-13[0][0]
__________________________________________________________________________________________________
batch-norm-12 (BatchNormalizati (None, 6, 7, 2)      24          conv2d-12[0][0]
__________________________________________________________________________________________________
flatten-2 (Flatten)             (None, 42)           0           leaky-relu-13[0][0]
__________________________________________________________________________________________________
leaky-relu-12 (LeakyReLU)       (None, 6, 7, 2)      0           batch-norm-12[0][0]
__________________________________________________________________________________________________
dense-1 (Dense)                 (None, 20)           840         flatten-2[0][0]
__________________________________________________________________________________________________
flatten-1 (Flatten)             (None, 84)           0           leaky-relu-12[0][0]
__________________________________________________________________________________________________
leaky-relu-14 (LeakyReLU)       (None, 20)           0           dense-1[0][0]
__________________________________________________________________________________________________
policy_head (Dense)             (None, 42)           3528        flatten-1[0][0]
__________________________________________________________________________________________________
value_head (Dense)              (None, 1)            20          leaky-relu-14[0][0]
==================================================================================================
Total params: 907,325
Trainable params: 907,169
Non-trainable params: 156
__________________________________________________________________________________________________
In [11]:
len(net.layers)
Out[11]:
51
In [12]:
net.render()
Out[12]:
AlphaZero NetworkLayer: policy_head (output) shape = (42,) Keras class = Dense use_bias = False activation = linear kernel_regularizer = <keras.regularizers.L1L2 object at 0x7ffb46395080>policy_headLayer: value_head (output) shape = (1,) Keras class = Dense use_bias = False activation = tanh kernel_regularizer = <keras.regularizers.L1L2 object at 0x7ffb46395da0>value_headWeights from flatten-1 to policy_head policy_head/kernel:0 has shape (84, 42)Weights from flatten-1 to policy_head policy_head/kernel:0 has shape (84, 42)Layer: main_input (input) shape = (6, 7, 2) Keras class = Inputmain_input20

3.13.3. Connecting the Network to the Game

First, we need a mapping from game (x,y) moves to a position in a list of actions and probabilities.

In [13]:
def make_mappings(game):
    """
    Get a mapping from game's (x,y) to array position.
    """
    move2pos = {}
    pos2move = []
    position = 0
    for y in range(game.v, 0, -1):
        for x in range(1, game.h + 1):
            move2pos[(x,y)] = position
            pos2move.append((x,y))
            position += 1
    return move2pos, pos2move

We use the connectFour game, defined above:

In [14]:
move2pos, pos2move = make_mappings(game)
In [15]:
move2pos[(2,1)]
Out[15]:
36
In [16]:
pos2move[35]
Out[16]:
(1, 1)

Need a method of converting a list of state moves into an array:

In [17]:
def state2array(game, state):
    array = []
    to_move = game.to_move(state)
    for y in range(game.v, 0, -1):
        for x in range(1, game.h + 1):
            item = state.board.get((x, y), 0)
            if item != 0:
                item = 1 if item == to_move else -1
            array.append(item)
    return array
In [18]:
cx.shape(state2array(game, game.initial))
Out[18]:
(42,)

So, state2array returns a list of 42 numbers, where:

  • 0 represents an empty place
  • 1 represents one of my pieces
  • -1 represents one of my opponent’s pieces

Note that “my” and “my opponent” may swap back and forth depending on perspective (ie, whose turn it is, as determined by game.to_move(state)).

In [19]:
def state2inputs(game, state):
    board = np.array(state2array(game, state)) # 1 is my pieces, -1 other
    currentplayer_position = np.zeros(len(board), dtype=np.int)
    currentplayer_position[board==1] = 1
    other_position = np.zeros(len(board), dtype=np.int)
    other_position[board==-1] = 1
    position = np.array(list(zip(currentplayer_position,other_position)))
    inputs = position.reshape((game.v, game.h, 2))
    return inputs.tolist()

We need to convert the state’s board into a form for the neural network:

In [20]:
state2inputs(game, game.initial)
Out[20]:
[[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
 [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
 [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
 [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
 [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]],
 [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]]

We can check to see if this is correct by propagating the activations to the first layer.

Initial board state has no pieces on the board:

In [21]:
state = game.initial
net.propagate_to_features("main_input", state2inputs(game, state))
Out[21]:

Feature 0

Feature 1

Now we make a move to (1,1). But note that after the move, it is now the other player’s move. So the first move is seen on the opponent’s board (the right side, feature #1):

In [22]:
state = game.result(game.initial, (1,1))
net.propagate_to_features("main_input", state2inputs(game, state))
Out[22]:

Feature 0

Feature 1

Now, the second player moves to (2,1). Now we are back to the original perspective, and so the right-hand board is on the left, because that is now the current player’s perspective.

In [23]:
state = game.result(state, (3,1))
net.propagate_to_features("main_input", state2inputs(game, state))
Out[23]:

Feature 0

Feature 1

Finally, we are ready to connect the game to the network. We define a function get_predictions that takes a game and state, and propagates it through the network returning a (value, probabilities, allowedActions). The probabilities are the pi list from the AlphaZero paper.

In [24]:
def get_predictions(net, game, state):
    """
    Given a state, give output of network on preferred
    actions. state.allowedActions removes impossible
    actions.

    Returns (value, probabilties, allowedActions)
    """
    board = np.array(state2array(game, state)) # 1 is my pieces, -1 other
    inputs = state2inputs(game, state)
    preds = net.propagate(inputs, visualize=True)
    value = preds[1][0]
    logits = np.array(preds[0])
    allowedActions = np.array([move2pos[act] for act in game.actions(state)])
    mask = np.ones(len(board), dtype=bool)
    mask[allowedActions] = False
    logits[mask] = -100
    #SOFTMAX
    odds = np.exp(logits)
    probs = odds / np.sum(odds) ###put this just before the for?
    return (value, probs.tolist(), allowedActions.tolist())
In [25]:
value, probs, acts = get_predictions(net, game, state)
In [26]:
net.snapshot(state2inputs(game, state))
Out[26]:
AlphaZero NetworkLayer: policy_head (output) shape = (42,) Keras class = Dense use_bias = False activation = linear kernel_regularizer = <keras.regularizers.L1L2 object at 0x7ffb46395080>policy_headLayer: value_head (output) shape = (1,) Keras class = Dense use_bias = False activation = tanh kernel_regularizer = <keras.regularizers.L1L2 object at 0x7ffb46395da0>value_headWeights from flatten-1 to policy_head policy_head/kernel:0 has shape (84, 42)Weights from flatten-1 to policy_head policy_head/kernel:0 has shape (84, 42)Layer: main_input (input) shape = (6, 7, 2) Keras class = Inputmain_input20

3.13.4. Testing Game and Network Integration

Finally, we turn the predictions into a move, and we can play a game with the network.

In [27]:
class NNPlayer(Player):

    def set_game(self, game):
        """
        Get a mapping from game's (x,y) to array position.
        """
        self.net = make_network(game)
        self.game = game
        self.move2pos = {}
        self.pos2move = []
        position = 0
        for y in range(self.game.v, 0, -1):
            for x in range(1, self.game.h + 1):
                self.move2pos[(x,y)] = position
                self.pos2move.append((x,y))
                position += 1

    def get_predictions(self, state):
        """
        Given a state, give output of network on preferred
        actions. state.allowedActions removes impossible
        actions.

        Returns (value, probabilties, allowedActions)
        """
        board = np.array(self.state2array(state)) # 1 is my pieces, -1 other
        inputs = self.state2inputs(state)
        preds = self.net.propagate(inputs)
        value = preds[1][0]
        logits = np.array(preds[0])
        allowedActions = np.array([self.move2pos[act] for act in self.game.actions(state)])
        mask = np.ones(len(board), dtype=bool)
        mask[allowedActions] = False
        logits[mask] = -100
        #SOFTMAX
        odds = np.exp(logits)
        probs = odds / np.sum(odds)
        return (value, probs.tolist(), allowedActions.tolist())

    def get_action(self, state, turn):
        value, probabilities, moves = self.get_predictions(state)
        probs = np.array(probabilities)[moves]
        pos = cx.choice(moves, probs)
        return self.pos2move[pos]

    def state2inputs(self, state):
        board = np.array(self.state2array(state)) # 1 is my pieces, -1 other
        currentplayer_position = np.zeros(len(board), dtype=np.int)
        currentplayer_position[board==1] = 1
        other_position = np.zeros(len(board), dtype=np.int)
        other_position[board==-1] = 1
        position = np.array(list(zip(currentplayer_position,other_position)))
        inputs = position.reshape((self.game.v, self.game.h, 2))
        return inputs

    def state2array(self, state):
        array = []
        to_move = self.game.to_move(state)
        for y in range(self.game.v, 0, -1):
            for x in range(1, self.game.h + 1):
                item = state.board.get((x, y), 0)
                if item != 0:
                    item = 1 if item == to_move else -1
                array.append(item)
        return array
In [28]:
p1 = RandomPlayer("Random")
p2 = NNPlayer("NNPlayer")
In [29]:
p2.set_game(game)
In [30]:
p2.get_action(state, 2)
Out[30]:
(2, 1)
In [31]:
game.play_game(p1, p2)
NNPlayer is thinking...
NNPlayer makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X . . . . .
Random is thinking...
Random makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X . O . . .
NNPlayer is thinking...
NNPlayer makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X . O X . .
Random is thinking...
Random makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. X . O X . .
NNPlayer is thinking...
NNPlayer makes action (5, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O X . .
. X . O X . .
Random is thinking...
Random makes action (5, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . O X . .
. X . O X . .
NNPlayer is thinking...
NNPlayer makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. . . O X . .
. X X O X . .
Random is thinking...
Random makes action (2, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. O . O X . .
. X X O X . .
NNPlayer is thinking...
NNPlayer makes action (3, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. O X O X . .
. X X O X . .
Random is thinking...
Random makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . O . .
. O X O X . .
. X X O X . O
NNPlayer is thinking...
NNPlayer makes action (3, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . O . .
. O X O X . .
. X X O X . O
Random is thinking...
Random makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . O . .
. O X O X . .
. X X O X O O
NNPlayer is thinking...
NNPlayer makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . O . .
. O X O X . .
X X X O X O O
Random is thinking...
Random makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . O . .
. O X O X . O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (2, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X X . O . .
. O X O X . O
X X X O X O O
Random is thinking...
Random makes action (3, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . O . . . .
. X X . O . .
. O X O X . O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . O . . . .
. X X . O . .
X O X O X . O
X X X O X O O
Random is thinking...
Random makes action (5, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . O . O . .
. X X . O . .
X O X O X . O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (3, 5):
. . . . . . .
. . . . . . .
. . X . . . .
. . O . O . .
. X X . O . .
X O X O X . O
X X X O X O O
Random is thinking...
Random makes action (5, 5):
. . . . . . .
. . . . . . .
. . X . O . .
. . O . O . .
. X X . O . .
X O X O X . O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (1, 3):
. . . . . . .
. . . . . . .
. . X . O . .
. . O . O . .
X X X . O . .
X O X O X . O
X X X O X O O
Random is thinking...
Random makes action (3, 6):
. . . . . . .
. . O . . . .
. . X . O . .
. . O . O . .
X X X . O . .
X O X O X . O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (5, 6):
. . . . . . .
. . O . X . .
. . X . O . .
. . O . O . .
X X X . O . .
X O X O X . O
X X X O X O O
Random is thinking...
Random makes action (6, 2):
. . . . . . .
. . O . X . .
. . X . O . .
. . O . O . .
X X X . O . .
X O X O X O O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (2, 4):
. . . . . . .
. . O . X . .
. . X . O . .
. X O . O . .
X X X . O . .
X O X O X O O
X X X O X O O
Random is thinking...
Random makes action (4, 3):
. . . . . . .
. . O . X . .
. . X . O . .
. X O . O . .
X X X O O . .
X O X O X O O
X X X O X O O
NNPlayer is thinking...
NNPlayer makes action (1, 4):
. . . . . . .
. . O . X . .
. . X . O . .
X X O . O . .
X X X O O . .
X O X O X O O
X X X O X O O
***** NNPlayer wins!
Out[31]:
['NNPlayer']

3.13.5. Training The Network

Now we are ready to train the network. The training is a clever use of Monte Carlo Tree Search, combined with playing against itself.

There is a Monte Carlo Tree Search player in aima3 that we will use. We set the policy to come from predictions from the neural network.

In [32]:
class AlphaZeroMCTSPlayer(MCTSPlayer):
    """
    A Monte Carlo Tree Search with policy function from
    neural network. Network will be set later to self.nnplayer.
    """
    def policy(self, game, state):
        # these moves are positions:
        value, probs_all, moves = self.nnplayer.get_predictions(state)
        if len(moves) == 0:
            result = [], value
        else:
            probs = np.array(probs_all)[moves]
            moves = [self.nnplayer.pos2move[pos] for pos in moves]
            # we need to return probs and moves for game
            result = [(act, prob) for (act, prob) in list(zip(moves, probs))], value
        return result

The main AlphaZeroPlayer needs to be able to play in one of two modes:

  • self_play: it plays against itself (using two different MCTS, as this version requires it). The network provides policy evaulation for each state is it looks ahead.
  • regular play: moves come directly from the network
In [33]:
class AlphaZeroPlayer(NNPlayer):
    ## Load weights if continuing
    def __init__(self, name, n_playout=40, *args, **kwargs):
        super().__init__(name, *args, **kwargs)
        self.mcts_players = [AlphaZeroMCTSPlayer("MCTS-1", n_playout=n_playout),
                             AlphaZeroMCTSPlayer("MCTS-2", n_playout=n_playout)]

    def set_game(self, game):
        super().set_game(game)
        self.mcts_players[0].set_game(game)
        self.mcts_players[1].set_game(game)
        self.mcts_players[0].nnplayer = self
        self.mcts_players[1].nnplayer = self
        self.data = [[], []]
        self.cache = {}

    def get_action(self, state, turn, self_play):
        if self_play:
            ## Only way to determine which is which?
            if turn in self.cache:
                player_num = 1
            else:
                player_num = 0
                self.cache[turn] = True
            ## now use the policy to get some probs:
            move, pi = self.mcts_players[player_num].get_action(state, round(turn), return_prob=True)
            ## save the state and probs:
            self.data[player_num].append((self.state2inputs(state), self.move_probs2all_probs(pi)))
            return move
        else:
            # play the network, were're in the playoffs!
            return super().get_action(state, round(turn))

    def move_probs2all_probs(self, move_probs):
        all_probs = np.zeros(len(self.state2array(game.initial)))
        for move in move_probs:
            all_probs[self.move2pos[move]] = move_probs[move]
        return all_probs.tolist()

We now set up the game to play in one of the two modes.

One complication when playing itself: the system isn’t sure which one it is, and we want to separate the two plays! To keep track, we cache the turn; if we see the same turn again, then we know it is the second.

In [34]:
class AlphaZeroGame(ConnectFour):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.memory = []

    def play_game(self, *players, flip_coin=False, verbose=1, **kwargs):
        results = super().play_game(*players, flip_coin=flip_coin, verbose=verbose, **kwargs)
        if "self_play" in kwargs and kwargs["self_play"]:
            ## Do not allow flipping coins when self play:
            ## Assumes that player1 == player2 when self-playing
            assert flip_coin is False, "no coin_flip when self-playing"
            ## value is in terms of player 0
            value = self.final_utility
            for state, probs in players[0].data[0]:
                self.memory.append([state, [probs, [value]]])
            # also data from opponent, so flip value:
            value = -value
            for state, probs in players[1].data[1]:
                self.memory.append([state, [probs, [value]]])
        return results
In [35]:
game = AlphaZeroGame()
best_player = AlphaZeroPlayer("best_player")
current_player = AlphaZeroPlayer("current_player")

Some basic tests to make sure things are going in the right place:

In [36]:
current_player.set_game(game)
assert current_player.data == [[], []]
print(current_player.get_action(game.initial, 1, self_play=False))
assert current_player.data == [[], []]
print(current_player.get_action(game.initial, 1, self_play=True))
assert current_player.data[0] != []
print(current_player.get_action(game.initial, 1, self_play=True))
assert current_player.data[1] != []
(5, 1)
(4, 1)
(3, 1)

Sample just for testing:

In [37]:
game.play_tournament(1, best_player, best_player, verbose=1, mode="ordered", self_play=True)
Tournament to begin with 2 matches...
best_player is thinking...
best_player makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . .
best_player is thinking...
best_player makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X . . . O
best_player is thinking...
best_player makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X X . . . O
best_player is thinking...
best_player makes action (6, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. X X . . O O
best_player is thinking...
best_player makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X X X . . O O
best_player is thinking...
best_player makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X X X O . O O
best_player is thinking...
best_player makes action (5, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X X X O X O O
best_player is thinking...
best_player makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
X X X O X O O
best_player is thinking...
best_player makes action (3, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . X O . . .
X X X O X O O
best_player is thinking...
best_player makes action (4, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. . X O . . .
X X X O X O O
best_player is thinking...
best_player makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. . X O . . X
X X X O X O O
best_player is thinking...
best_player makes action (4, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. . . O . . .
. . . O . . .
. . X O . . X
X X X O X O O
***** best_player wins!
best_player is thinking...
best_player makes action (7, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . X
best_player is thinking...
best_player makes action (2, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . X
best_player is thinking...
best_player makes action (4, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . X . . X
best_player is thinking...
best_player makes action (2, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . .
. O . X . . X
best_player is thinking...
best_player makes action (1, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . .
X O . X . . X
best_player is thinking...
best_player makes action (7, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . O
X O . X . . X
best_player is thinking...
best_player makes action (4, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . X . . O
X O . X . . X
best_player is thinking...
best_player makes action (3, 1):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (1, 2):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (2, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . .
X O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (1, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X O . . . . .
X O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (4, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X O . O . . .
X O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (7, 3):
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
X O . O . . X
X O . X . . O
X O O X . . X
best_player is thinking...
best_player makes action (2, 4):
. . . . . . .
. . . . . . .
. . . . . . .
. O . . . . .
X O . O . . X
X O . X . . O
X O O X . . X
***** best_player wins!
Out[37]:
{'DRAW': 0, 'best_player': 2}

Did we collect some history?

In [38]:
len(game.memory)
Out[38]:
26

Ok, we are ready to learn!

In [49]:
config = dict(
    MINIMUM_MEMORY_SIZE_BEFORE_TRAINING = 1000, # min size of memory
    TRAINING_EPOCHS_PER_CYCLE = 500, # training on current network
    CYCLES = 1, # number of cycles to run
    SELF_PLAY_MATCHES = 1, # matches to test yo' self per self-play round
    TOURNAMENT_MATCHES = 2, # plays each player as first mover per match, so * 2
    BEST_SWAP_PERCENT = 1.0, # you must be this much better than best
)
In [51]:
def alphazero_train(config):
    ## Uses global game, best_player, and current_player
    for cycle in range(config["CYCLES"]):
        print("Epoch #%s..." % cycle)
        # self-play, collect data:
        print("Self-play matches begin...")
        while len(game.memory) < config["MINIMUM_MEMORY_SIZE_BEFORE_TRAINING"]:
            results = game.play_tournament(config["SELF_PLAY_MATCHES"],
                                           best_player, best_player,
                                           mode="ordered", self_play=True)
            print("Memory size is %s" % len(game.memory))
        print("Enough to train!")
        current_player.net.dataset.clear()
        current_player.net.dataset.load(game.memory)
        print("Training on ", len(current_player.net.dataset.inputs), "patterns...")
        current_player.net.train(config["TRAINING_EPOCHS_PER_CYCLE"],
                                 batch_size=len(game.memory),
                                 plot=True)
        ## save dataset every once in a while
        ## now see which net is better:
        print("Playing best vs current to see who wins the title...")
        results = game.play_tournament(config["TOURNAMENT_MATCHES"],
                                       best_player, current_player,
                                       mode="one-each", self_play=False)
        if results["current_player"] > results["best_player"] * config["BEST_SWAP_PERCENT"]:
            print("current won! swapping weights")
            # give the better weights to the best_player
            best_player.net.set_weights(
                current_player.net.get_weights())
            game.memory = []
        else:
            print("best won!")
In [ ]:
alphazero_train(config)
Epoch #0...
Self-play matches begin...
Memory size is 171
Memory size is 208
Memory size is 247
Memory size is 284
Memory size is 310
Memory size is 349
Memory size is 387
Memory size is 425
Memory size is 455
Memory size is 481
Memory size is 541
Memory size is 571
Memory size is 608
Memory size is 657
Memory size is 703
Memory size is 744
Memory size is 788
Memory size is 833
Memory size is 878
Memory size is 920
Memory size is 979
Memory size is 1028
Enough to train!
Training on  1028 patterns...
Training...
       |  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#  801 |   0.39601 |   0.00000 |   0.92708
#  802 |   1.03875 |   0.00000 |   0.27140
#  803 |   0.87153 |   0.00000 |   0.15953
#  804 |   0.79308 |   0.00000 |   0.18191
#  805 |   0.73491 |   0.00000 |   0.29475
#  806 |   0.73332 |   0.00000 |   0.35019
#  807 |   0.70390 |   0.00000 |   0.36868
#  808 |   0.67224 |   0.00000 |   0.40467
#  809 |   0.64956 |   0.00000 |   0.46790
#  810 |   0.65823 |   0.00000 |   0.49222
#  811 |   0.64335 |   0.00000 |   0.49611
#  812 |   0.68785 |   0.00000 |   0.47374
#  813 |   0.64451 |   0.00000 |   0.50486
#  814 |   0.62052 |   0.00000 |   0.51946
#  815 |   0.65647 |   0.00000 |   0.47179
#  816 |   0.71668 |   0.00000 |   0.45039
#  817 |   0.69026 |   0.00000 |   0.45525
#  818 |   0.62992 |   0.00000 |   0.53794
#  819 |   0.59203 |   0.00000 |   0.61479
#  820 |   0.62125 |   0.00000 |   0.60019
#  821 |   0.59042 |   0.00000 |   0.62257
#  822 |   0.56548 |   0.00000 |   0.64008
#  823 |   0.57956 |   0.00000 |   0.68288
#  824 |   0.60517 |   0.00000 |   0.60798
#  825 |   0.56350 |   0.00000 |   0.65953
#  826 |   0.54747 |   0.00000 |   0.71304
#  827 |   0.54154 |   0.00000 |   0.73054
#  828 |   0.57241 |   0.00000 |   0.69553
#  829 |   0.56825 |   0.00000 |   0.70914
#  830 |   0.55524 |   0.00000 |   0.69747
#  831 |   0.54866 |   0.00000 |   0.71109
#  832 |   0.54924 |   0.00000 |   0.71693
#  833 |   0.54617 |   0.00000 |   0.73249
#  834 |   0.53730 |   0.00000 |   0.73638
#  835 |   0.53897 |   0.00000 |   0.74125
#  836 |   0.54266 |   0.00000 |   0.72860
#  837 |   0.52672 |   0.00000 |   0.75486
#  838 |   0.53487 |   0.00000 |   0.77626
#  839 |   0.53721 |   0.00000 |   0.73054
#  840 |   0.56242 |   0.00000 |   0.70623
#  841 |   0.54416 |   0.00000 |   0.71693
#  842 |   0.53821 |   0.00000 |   0.73054
#  843 |   0.52338 |   0.00000 |   0.76654
#  844 |   0.52326 |   0.00000 |   0.77335
#  845 |   0.53004 |   0.00000 |   0.75973
#  846 |   0.52644 |   0.00000 |   0.77918
#  847 |   0.52099 |   0.00000 |   0.78113
#  848 |   0.52136 |   0.00000 |   0.78599
#  849 |   0.51461 |   0.00000 |   0.79961
#  850 |   0.52613 |   0.00000 |   0.75389
#  851 |   0.52171 |   0.00000 |   0.78210
#  852 |   0.52806 |   0.00000 |   0.76556
#  853 |   0.52643 |   0.00000 |   0.78502
#  854 |   0.52034 |   0.00000 |   0.78599
#  855 |   0.51269 |   0.00000 |   0.79183
#  856 |   0.51479 |   0.00000 |   0.78210
#  857 |   0.50916 |   0.00000 |   0.79572
#  858 |   0.50813 |   0.00000 |   0.79475
#  859 |   0.50480 |   0.00000 |   0.81615
#  860 |   0.50278 |   0.00000 |   0.82879
#  861 |   0.49981 |   0.00000 |   0.82296
#  862 |   0.51360 |   0.00000 |   0.80447
#  863 |   0.50122 |   0.00000 |   0.79377
#  864 |   0.49991 |   0.00000 |   0.81712
#  865 |   0.49977 |   0.00000 |   0.82101
#  866 |   0.49591 |   0.00000 |   0.81323
#  867 |   0.49709 |   0.00000 |   0.83074
#  868 |   0.49621 |   0.00000 |   0.82685
#  869 |   0.49390 |   0.00000 |   0.81809
#  870 |   0.49332 |   0.00000 |   0.82685
#  871 |   0.49415 |   0.00000 |   0.82588
#  872 |   0.49587 |   0.00000 |   0.82198
#  873 |   0.50590 |   0.00000 |   0.76167
#  874 |   0.50653 |   0.00000 |   0.79280
#  875 |   0.50713 |   0.00000 |   0.79669
#  876 |   0.50457 |   0.00000 |   0.78210
#  877 |   0.50043 |   0.00000 |   0.80447
#  878 |   0.51643 |   0.00000 |   0.74805
#  879 |   0.51950 |   0.00000 |   0.74903
#  880 |   0.51771 |   0.00000 |   0.77140
#  881 |   0.49895 |   0.00000 |   0.77140
#  882 |   0.49755 |   0.00000 |   0.81420
#  883 |   0.49531 |   0.00000 |   0.79669
#  884 |   0.51883 |   0.00000 |   0.77335
#  885 |   0.51188 |   0.00000 |   0.76265
#  886 |   0.50649 |   0.00000 |   0.78307
#  887 |   0.50355 |   0.00000 |   0.78405
#  888 |   0.49438 |   0.00000 |   0.79864
#  889 |   0.49338 |   0.00000 |   0.82101
#  890 |   0.49197 |   0.00000 |   0.82393
#  891 |   0.48883 |   0.00000 |   0.82490
#  892 |   0.49004 |   0.00000 |   0.83658
#  893 |   0.48731 |   0.00000 |   0.82198
#  894 |   0.48745 |   0.00000 |   0.83560
#  895 |   0.49023 |   0.00000 |   0.81907
#  896 |   0.48879 |   0.00000 |   0.82393
#  897 |   0.48995 |   0.00000 |   0.82198
#  898 |   0.48738 |   0.00000 |   0.81809
#  899 |   0.48518 |   0.00000 |   0.81615
#  900 |   0.48521 |   0.00000 |   0.83852
#  901 |   0.48788 |   0.00000 |   0.82101
#  902 |   0.48487 |   0.00000 |   0.83268
#  903 |   0.48504 |   0.00000 |   0.82296
#  904 |   0.48348 |   0.00000 |   0.82782
#  905 |   0.48291 |   0.00000 |   0.83074
#  906 |   0.48198 |   0.00000 |   0.83074
#  907 |   0.48150 |   0.00000 |   0.83560
#  908 |   0.47950 |   0.00000 |   0.83463
#  909 |   0.47981 |   0.00000 |   0.83852
#  910 |   0.48167 |   0.00000 |   0.84047
#  911 |   0.48813 |   0.00000 |   0.75486
#  912 |   0.48913 |   0.00000 |   0.82101
#  913 |   0.48447 |   0.00000 |   0.80058
#  914 |   0.48232 |   0.00000 |   0.83658
#  915 |   0.47927 |   0.00000 |   0.83658
#  916 |   0.47946 |   0.00000 |   0.83463
#  917 |   0.48028 |   0.00000 |   0.83463
#  918 |   0.47864 |   0.00000 |   0.82490
#  919 |   0.47678 |   0.00000 |   0.83949
#  920 |   0.47827 |   0.00000 |   0.83755
#  921 |   0.47766 |   0.00000 |   0.84047
#  922 |   0.47743 |   0.00000 |   0.82296
#  923 |   0.47477 |   0.00000 |   0.83755
#  924 |   0.47580 |   0.00000 |   0.83755
#  925 |   0.47772 |   0.00000 |   0.84241
#  926 |   0.47633 |   0.00000 |   0.82879
#  927 |   0.47586 |   0.00000 |   0.83658
#  928 |   0.47410 |   0.00000 |   0.84922
#  929 |   0.47460 |   0.00000 |   0.83755
#  930 |   0.47401 |   0.00000 |   0.83463
#  931 |   0.47358 |   0.00000 |   0.83463
#  932 |   0.48433 |   0.00000 |   0.77529
#  933 |   0.48837 |   0.00000 |   0.80156
#  934 |   0.49228 |   0.00000 |   0.75000
#  935 |   0.53406 |   0.00000 |   0.72957
#  936 |   0.59349 |   0.00000 |   0.65370
#  937 |   0.54461 |   0.00000 |   0.65661
#  938 |   0.51751 |   0.00000 |   0.73541
#  939 |   0.51946 |   0.00000 |   0.76654
#  940 |   0.51198 |   0.00000 |   0.74611
#  941 |   0.49071 |   0.00000 |   0.79572
#  942 |   0.48741 |   0.00000 |   0.80837
#  943 |   0.48940 |   0.00000 |   0.83171
#  944 |   0.49434 |   0.00000 |   0.78307
#  945 |   0.50810 |   0.00000 |   0.79086
#  946 |   0.50190 |   0.00000 |   0.77529
#  947 |   0.50138 |   0.00000 |   0.79280
#  948 |   0.49911 |   0.00000 |   0.79572
#  949 |   0.50314 |   0.00000 |   0.80739
#  950 |   0.49738 |   0.00000 |   0.80058
#  951 |   0.53510 |   0.00000 |   0.74222
#  952 |   0.53245 |   0.00000 |   0.75973
#  953 |   0.50113 |   0.00000 |   0.79280
#  954 |   0.49678 |   0.00000 |   0.80642
#  955 |   0.49968 |   0.00000 |   0.82004
#  956 |   0.49353 |   0.00000 |   0.79864
#  957 |   0.49409 |   0.00000 |   0.82393
#  958 |   0.48850 |   0.00000 |   0.81712
#  959 |   0.48732 |   0.00000 |   0.82198
#  960 |   0.48583 |   0.00000 |   0.83463
#  961 |   0.48717 |   0.00000 |   0.83560
#  962 |   0.49166 |   0.00000 |   0.78988
#  963 |   0.48718 |   0.00000 |   0.82101
#  964 |   0.48449 |   0.00000 |   0.84436
#  965 |   0.48572 |   0.00000 |   0.84241
#  966 |   0.48336 |   0.00000 |   0.83366
#  967 |   0.48527 |   0.00000 |   0.82198
#  968 |   0.48378 |   0.00000 |   0.83658
#  969 |   0.48560 |   0.00000 |   0.83463
#  970 |   0.48140 |   0.00000 |   0.83560
#  971 |   0.47973 |   0.00000 |   0.84047
#  972 |   0.48026 |   0.00000 |   0.84339
#  973 |   0.52743 |   0.00000 |   0.71109
#  974 |   0.51828 |   0.00000 |   0.73152
#  975 |   0.51169 |   0.00000 |   0.75681
#  976 |   0.50040 |   0.00000 |   0.77626
#  977 |   0.49987 |   0.00000 |   0.80545
#  978 |   0.49475 |   0.00000 |   0.79961
#  979 |   0.48799 |   0.00000 |   0.82685
#  980 |   0.48720 |   0.00000 |   0.83171
#  981 |   0.48790 |   0.00000 |   0.84241
#  982 |   0.48720 |   0.00000 |   0.82296
#  983 |   0.48248 |   0.00000 |   0.83366
#  984 |   0.48260 |   0.00000 |   0.85019
#  985 |   0.48272 |   0.00000 |   0.83560
#  986 |   0.48424 |   0.00000 |   0.82101
#  987 |   0.48143 |   0.00000 |   0.84047
#  988 |   0.48048 |   0.00000 |   0.82977
#  989 |   0.47887 |   0.00000 |   0.84922
#  990 |   0.48022 |   0.00000 |   0.84241
#  991 |   0.47867 |   0.00000 |   0.84533
#  992 |   0.47878 |   0.00000 |   0.84241
#  993 |   0.47876 |   0.00000 |   0.83560
#  994 |   0.47767 |   0.00000 |   0.84144
#  995 |   0.47786 |   0.00000 |   0.85214
#  996 |   0.47649 |   0.00000 |   0.84533
#  997 |   0.47572 |   0.00000 |   0.84728
#  998 |   0.47814 |   0.00000 |   0.83560
#  999 |   0.47693 |   0.00000 |   0.82588
# 1000 |   0.47429 |   0.00000 |   0.82977
# 1001 |   0.47584 |   0.00000 |   0.85798
# 1002 |   0.47488 |   0.00000 |   0.84825
# 1003 |   0.47582 |   0.00000 |   0.83560
# 1004 |   0.47530 |   0.00000 |   0.83366
# 1005 |   0.47396 |   0.00000 |   0.83852
# 1006 |   0.47526 |   0.00000 |   0.82879
# 1007 |   0.47325 |   0.00000 |   0.83755
# 1008 |   0.47301 |   0.00000 |   0.83949
# 1009 |   0.47212 |   0.00000 |   0.85117
# 1010 |   0.47124 |   0.00000 |   0.84728
# 1011 |   0.47168 |   0.00000 |   0.85700
# 1012 |   0.49259 |   0.00000 |   0.79864
# 1013 |   0.49331 |   0.00000 |   0.78891
# 1014 |   0.48602 |   0.00000 |   0.79280
# 1015 |   0.51709 |   0.00000 |   0.73444
# 1016 |   0.50436 |   0.00000 |   0.76070
# 1017 |   0.49104 |   0.00000 |   0.78016
# 1018 |   0.48956 |   0.00000 |   0.80253
# 1019 |   0.54473 |   0.00000 |   0.76459
# 1020 |   0.51186 |   0.00000 |   0.74514
In [40]:
len(game.memory)
Out[40]:
133

Let’s train best_player some more:

In [48]:
best_player.net.train(1000, report_rate=5, plot=True)
_images/AlphaZero_71_0.svg
Interrupted! Cleaning up...
========================================================================
       |  Training |    policy |     value
Epochs |     Error |  head acc |  head acc
------ | --------- | --------- | ---------
#  801 |   0.39601 |   0.00000 |   0.92708
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-48-5c805804ac93> in <module>()
      1 current_player.net.dataset.clear()
      2 current_player.net.dataset.load(game.memory)
----> 3 current_player.net.train(1000, report_rate=5, plot=True)

~/.local/lib/python3.6/site-packages/conx/network.py in train(self, epochs, accuracy, error, batch_size, report_rate, verbose, kverbose, shuffle, tolerance, class_weight, sample_weight, use_validation_to_stop, plot, record, callbacks, save)
   1262                 print("Saved!")
   1263         if interrupted:
-> 1264             raise KeyboardInterrupt
   1265         if verbose == 0:
   1266             return (self.epoch_count, result)

KeyboardInterrupt:
In [44]:
best_player.net["policy_head"].vshape = (6,7)
best_player.net.config["show_targets"] = True
In [45]:
best_player.net.dashboard()
In [ ]:
Now, you can play the best player to see how it does:
In [ ]:
p1 = QueryPlayer("Your Name")
p2 = NNPlayer("Trained AlphaZero")
p2.net = best_player.net
connect4 = ConnectFour()
connect4.play_game(p1, p2)

3.13.6. Summary

  • Play against itself, at just the right level. Evolution-style.
  • Uses search in training.