3.11. Alice in Wonderland

This notebook demonstrates generating sequences using a Simple Recurrent Network (SimpleRNN).

For this example, we will use the unprocessed text from Lewis Carroll’s “Alice in Wonderland”. However, the sequence can really be anything, including code, music, or knitting instructions.

In [1]:
import conx as cx
Using TensorFlow backend.
Conx, version 3.6.1

First, we find a copy of Alice in Wonderland, download it, and read it in:

In [2]:
INPUT_FILE = "alice_in_wonderland.txt"
In [3]:
cx.download("http://www.gutenberg.org/files/11/11-0.txt", filename=INPUT_FILE)
Using cached http://www.gutenberg.org/files/11/11-0.txt as './alice_in_wonderland.txt'.
In [4]:
# extract the input as a stream of characters
lines = []
with open(INPUT_FILE, 'rb') as fp:
    for line in fp:
        line = line.strip().lower()
        line = line.decode("ascii", "ignore")
        if len(line) == 0:
            continue
        lines.append(line)
text = " ".join(lines)
lines = None # clean up memory

Next, we create some utility dictionaries for mapping the characters to indices and back:

In [5]:
chars = set([c for c in text])
nb_chars = len(chars)
char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))
In [6]:
nb_chars
Out[6]:
55

In this text, there are 55 different characters.

Each character has a unique mapping to an integer:

In [7]:
char2index["a"]
Out[7]:
24
In [8]:
index2char[5]
Out[8]:
']'

3.11.1. Build the Dataset

Next we build the dataset. We do this by stepping through the text one character at a time, building an input
sequence the size of SEQLEN and associated target character.

For example, assume an input sequence of “the sky was falling”, we would get the following inputs and targets:

Inputs     -> Target
----------    ------
the sky wa -> s
he sky was ->
e sky was  -> f
 sky was f -> a
sky was fa -> l

How can we represent the characters? There are many ways, including using an EmbeddingLayer. In this example, we simply use a onehot encoding of the index. Note that the total length of the onehot encoding is one more than the total number of items. That is because we will use a position for the zero index as well.

In [9]:
SEQLEN = 10
data = []
for i in range(0, len(text) - SEQLEN):
    inputs = [cx.onehot(char2index[char], nb_chars + 1) for char in text[i:i + SEQLEN]]
    targets = [cx.onehot(char2index[char], nb_chars + 1) for char in text[i + SEQLEN]][0]
    data.append([inputs, targets])
text = None # clean up memory
In [10]:
dataset = cx.Dataset()
dataset.load(data)
data = None # clean up memory; not needed
In [11]:
len(dataset)
Out[11]:
158773
In [12]:
cx.shape(dataset.inputs[0])
Out[12]:
(10, 56)

The shape of the inputs is 10 x 56; a sequence of length 10, and a vector of length 56.

Let’s check the inputs and targets to make sure everything is encoded properly:

In [13]:
def onehot_to_char(vector):
    index = cx.argmax(vector)
    return index2char[index]
In [14]:
for i in range(10):
    print("".join([onehot_to_char(v) for v in dataset.inputs[i]]),
          "->",
          onehot_to_char(dataset.targets[i]))
project gu -> t
roject gut -> e
oject gute -> n
ject guten -> b
ect gutenb -> e
ct gutenbe -> r
t gutenber -> g
 gutenberg -> s
gutenbergs ->
utenbergs  -> a

Looks good!

3.11.2. Build the Network

We will use a single SimpleRNNLayer with a fully-connected output bank to compute the most likely predicted output character.

Note that we can use the categorical cross-entropy error function since we are using the “softmax” activation function on the output layer.

In this example, we unroll the inputs to provide explicit weights between each character in the sequence and the output.

In [15]:
network = cx.Network("Alice in Wonderland")
network.add(
    cx.Layer("input", (SEQLEN, nb_chars + 1)),
    cx.SimpleRNNLayer("rnn", 128,
                      return_sequences=False,
                      unroll=True),
    cx.Layer("output", nb_chars + 1, activation="softmax"),
)
network.connect()
network.compile(error="categorical_crossentropy", optimizer="rmsprop")
In [16]:
network.set_dataset(dataset)
In [17]:
network.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 10, 56)            0
_________________________________________________________________
rnn (SimpleRNN)              (None, 128)               23680
_________________________________________________________________
output (Dense)               (None, 56)                7224
=================================================================
Total params: 30,904
Trainable params: 30,904
Non-trainable params: 0
_________________________________________________________________
In [18]:
network.dashboard()

3.11.3. Train the Network

After each training epoch we will test the generated output.

We could use cx.choice(p=output) or cx.argmax(output) for picking the next character. Which works best for you?

In [19]:
def generate_text(sequence, count):
    for i in range(count):
        output = network.propagate(sequence)
        char = index2char[cx.argmax(output)]
        print(char, end="")
        sequence = sequence[1:] + [output]
    print()
In [20]:
for iteration in range(25):
    print("=" * 50)
    print("Iteration #: %d" % (network.epoch_count))
    results = network.train(1, batch_size=128, plot=False, verbose=0)
    sequence = network.dataset.inputs[cx.choice(len(network.dataset))]
    print("Generating from seed: %s" % ("".join([onehot_to_char(v) for v in sequence])))
    generate_text(sequence, 100)
network.plot_results()
==================================================
Iteration #: 0
Generating from seed: in in the
kitt      t
==================================================
Iteration #: 1
Generating from seed: w computer
e  i t tt t tt t tt t tt t ttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
==================================================
Iteration #: 2
Generating from seed: ns! she at
 ee i  e e te
==================================================
Iteration #: 3
Generating from seed: es, but i
sont to t t tt t to t t  t t tt t t  t t tt t tt t t  t t tt t tt t t  t t tt t t  t t tt t tt t t
==================================================
Iteration #: 4
Generating from seed: it lasted.
 io e es i t    t e    e t                     e         e    e    e    e    e    e   ee   ee   ee
==================================================
Iteration #: 5
Generating from seed: wo-- why,
sa e aiee  to   ire   ot tore  oe   ore  ie  tor t ie  toe tire  ioe tire  iot tore  ie   or t ie  t
==================================================
Iteration #: 6
Generating from seed:  time it a
nd eostes  o se sese s st site si t si tes  t te t se to t si test tort sitt si t st s st s st tite
==================================================
Iteration #: 7
Generating from seed:  works bas
  one tore tor tore ior tore oor tore tor iore tor tore tor tore tor tore tor tore tor tore tor tore
==================================================
Iteration #: 8
Generating from seed: he helped
anries tit sirestt s totst ti  tor str s tor sttre tir stirestar  tires to  tires to stites tor sttr
==================================================
Iteration #: 9
Generating from seed: very tired
 alicerel s til e  o se oit tore oor tiaes torse oate oo etine  io ses ee  ar tie se  to slare to  l
==================================================
Iteration #: 10
Generating from seed: project gu
tenberg-ta tarites tt a ailis to slaitis or t airit to slares atts arirs ar iles tt a tiris tore  tt
==================================================
Iteration #: 11
Generating from seed: neck from
tiries airele aitiie tirisa e  iines  i s ares sire ain line tine tonele eseris  irm tone  on lorsle
==================================================
Iteration #: 12
Generating from seed: eople in a
 longe soit toree at t e eses rotel  se  or itee io teine to teine  a ter s aorses oo toriee ti ele
==================================================
Iteration #: 13
Generating from seed: inkling be
 ini r n ot  an late a s liti a t ne ala  ont rotils  tot i lali aont antire a a s attin  attiree to
==================================================
Iteration #: 14
Generating from seed: whether th
e e tors  on hore  on trene eat lare aone oon lone oon lire  an eroee or eerel oir line oorelare  an
==================================================
Iteration #: 15
Generating from seed:  said alic
e, alict iaial  fte y ni son  o teres  ar an ate l an  o aleitid   an l tinie  atnerar ai ror tins a
==================================================
Iteration #: 16
Generating from seed: elf safe i
n thaiteds  on hitiled ootiniri ass ione io toi ey armere  on betele  ont rone ain sares on thete in
==================================================
Iteration #: 17
Generating from seed: ad kept a
murses oi iny aone aotired on  rictire oonelion ior iite aitere  on aotesiog oo ely aibe aotelinu ao
==================================================
Iteration #: 18
Generating from seed: the thistl
enes arocesiry  or trtiry tone  on ireiler  or eraily tone sontered  ors tone rone oon ests ers cous
==================================================
Iteration #: 19
Generating from seed: d another
wontetin toree ont rethoree lonsini  or hitil  oor eone  on tiois ootely tone ros oolily oinil  ite
==================================================
Iteration #: 20
Generating from seed: nds and fe
et ani eriey tine ionteron lore ainee outine ainiee inner aniine ione ioner on  lootioneretne eriaes
==================================================
Iteration #: 21
Generating from seed: ative work
   onenenane  ontl oute itnelyin lire aone inite astins  ottestant ot elas tome sone ooterserision i
==================================================
Iteration #: 22
Generating from seed: nge tale,
autier ait litely sine ton h gesery  orerree oon  int rnttle  on tranese oo mirteron totily ait lins
==================================================
Iteration #: 23
Generating from seed: round, if
i mone tone sont ta elporsetan lols aimile on eset tors thre sone sone to toine  aiser aonel ou elin
==================================================
Iteration #: 24
Generating from seed: f, i wonde
r shane aape son lo erate san looklen  on erters wrnc sothin liousesotel sine oor hires at tame rine
_images/AliceInWonderland_29_1.png

What can you say about the text generated in later epochs compared to the earlier generated text?

This was the simplest and most straightforward of network architectures and parameter settings. Can you do better? Can you generate text that is better English, or even text that captures the style of Lewis Carroll?

Next, you might like to try this kind of experiment on your own sequential data.